High-precision real-time semantic segmentation network for UAV remote sensing images

doi:10.3785/j.issn.1008-973X.2025.07.009

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (7): 1411-1420 DOI: 10.3785/j.issn.1008-973X.2025.07.009

High-precision real-time semantic segmentation network for UAV remote sensing images

Xinyu WEI1(

),Lei RAO1,*(

),Guangyu FAN1,Niansheng CHEN1,Songlin CHENG1,Dingyu YANG2

1. School of Electronic Information Engineering, Shanghai Dianji University, Shanghai 201306, China
2. State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou 310058, China

Download:

HTML

PDF(2440KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A shared shallow feature network (SSFNet) was proposed to address the issues of low inference efficiency and poor segmentation performance in semantic segmentation models for UAV images. The detail branch shared the 1/4 and 1/8 stages of downsampling in the semantic branch, simplifying the downsampling process of the detail branch and improving inference efficiency. In the semantic branch, a stacked connection approach was combined with a channel decomposition mechanism to construct an efficient receptive field block (ERFB), enhancing multi-scale feature extraction with minimal additional inference cost. To integrate contextual information within the semantic branch, a fast aggregation of context (FAC) module was proposed, and a gated mechanism was utilized to supplement the semantics of the final phase during the 1/16 and 1/32 downsampling stages. During the decoding phase, a bilateral fusion module (BFM) was constructed using a hybrid activation function to fully integrate detail and semantic information. Results show that SSFNet achieves mean intersection over union scores of 68.5%, 52.7%, and 87.1% on the UAVid, LoveDA, and Potsdam datasets, respectively. SSFNet achieves an inference speed of 131.1 frames per second at a 1 024×1 024 input resolution on an NVIDIA RTX 3090 GPU, indicating strong real-time segmentation performance.

Key words： real-time semantic segmentation UAV images remote sensing images convolutional neural networks multi-scale features.

Received: 25 June 2024 Published: 25 July 2025

CLC:

TP 391.41

Fund: 国家自然科学基金资助项目（61702320）；上海市晨光计划（15CG62）.

Corresponding Authors: Lei RAO E-mail: 226003010214@sdju.edu.cn;raol@sdju.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Xinyu WEI
	Lei RAO
	Guangyu FAN
	Niansheng CHEN
	Songlin CHENG
	Dingyu YANG

Cite this article:

Xinyu WEI,Lei RAO,Guangyu FAN,Niansheng CHEN,Songlin CHENG,Dingyu YANG. High-precision real-time semantic segmentation network for UAV remote sensing images. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1411-1420.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.07.009 OR https://www.zjujournals.com/eng/Y2025/V59/I7/1411

用于无人机遥感图像的高精度实时语义分割网络

用于无人机图像的语义分割模型存在推理效率低和分割效果差的问题，为此提出共享浅层特征网络（SSFNet）. 细节分支共享语义分支下采样时的1/4和1/8阶段，简化细节分支的下采样阶段，提高推理效率. 在语义分支部分，提出基于通道分解和堆叠连接的高效感受野模块（ERFB），在几乎不增加推理成本的情况下提高多尺度特征的提取能力. 为了整合语义分支中的上下文信息，提出快速聚合上下文（FAC）模块，利用门控机制控制下采样时的1/16和1/32阶段为最终阶段的语义补充信息. 在解码阶段，利用混合激活函数构建双边融合模块（BFM）以充分融合细节和语义信息. 结果表明，SSFNet在UAVid、LoveDA和Potsdam数据集上的平均交并比分别为68.5%、52.7%和87.1%；在NVIDIA RTX 3090 GPU输入分辨率为1 024×1 024的情况下，SSFNet的推理速度达到131.1 帧/s，实时分割效果良好.

关键词： 实时语义分割, 无人机图像, 遥感图像, 卷积神经网络, 多尺度特征

Fig.1 Comparison of different semantic segmentation network architectures

Fig.2 Overall architecture of shared shallow feature network

Fig.3 Image processing process for different models

Fig.4 Comparison of visual feature maps between bilateral fusion module and addition

Tab.1 Experimental results of modular ablation in different validation sets

Tab.2 Ablation experiments of various branches within bilateral fusion module

Tab.3 Ablation experimental results of receptive field block on UAVid validation set

Tab.4 Comparison of segmentation algorithms on parameters and inference speed

Fig.5 Visual comparison of image segmentation results of different algorithms on UAVid dataset

Tab.5 Performance comparison of segmentation algorithms on UAVid test set %

Fig.6 Visual comparison of image segmentation results of different algorithms in LoveDA dataset

Tab.6 Performance comparison of segmentation algorithms on LoveDA test set %

Fig.7 Visual comparison of image segmentation results of different algorithms on Potsdam dataset

Tab.7 Performance comparison of segmentation algorithms on Potsdam test set %

Tab.8 Inference speed of shared shallow feature network across different hardware environments


[1]	LI R, ZHENG S, DUAN C, et al Land cover classification from remote sensing images based on multi-scale fully convolutional network[J]. Geo-spatial Information Science, 2022, 25 (2): 278- 294 doi: 10.1080/10095020.2021.2017237

[2]	SHI W, ZHANG M, KE H, et al Landslide recognition by deep convolutional neural network and change detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 59 (6): 4654- 4672 doi: 10.1109/TGRS.2020.3015826

[3]	GRIFFITHS D, BOEHM J Improving public data for building segmentation from convolutional neural networks (CNNs) for fused airborne lidar and image data using active contours[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2019, 154: 70- 83 doi: 10.1016/j.isprsjprs.2019.05.013

[4]	刘毅, 陈一丹, 高琳, 等基于多尺度特征融合的轻量化道路提取模型[J]. 浙江大学学报: 工学版, 2024, 58 (5): 951- 959 LIU Yi, CHEN Yidan, GAO Lin, et al Lightweight road extraction model based on multi-scale feature fusion[J]. Journal of Zhejiang University: Engineering Science, 2024, 58 (5): 951- 959

[5]	ZHU X X, TUIA D, MOU L, et al Deep learning in remote sensing: a comprehensive review and list of resources[J]. IEEE Geoscience and Remote Sensing Magazine, 2017, 5 (4): 8- 36 doi: 10.1109/MGRS.2017.2762307

[6]	吴泽康, 赵姗, 李宏伟, 等遥感图像语义分割空间全局上下文信息网络[J]. 浙江大学学报: 工学版, 2022, 56 (4): 795- 802 WU Zekang, ZHAO Shan, LI Hongwei, et al Spatial global context information network for semantic segmentation of remote sensing image[J]. Journal of Zhejiang University: Engineering Science, 2022, 56 (4): 795- 802

[7]	LECUN Y, BENGIO Y, HINTON G Deep learning[J]. Nature, 2015, 521 (7553): 436- 444 doi: 10.1038/nature14539

[8]	LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3431–3440.

[9]	ZHAO H, SHI J, QI X, et al. Pyramid scene parsing network [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 6230–6239.

[10]	CHEN L C, PAPANDREOU G, KOKKINOS I, et al DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40 (4): 834- 848 doi: 10.1109/TPAMI.2017.2699184

[11]	CHEN L C, ZHU Y, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation [C]// Computer Vision – ECCV 2018. [S.l.]: Springer, 2018: 833–851.

[12]	SHI X, HUANG H, PU C, et al CSA-UNet: channel-spatial attention-based encoder–decoder network for rural blue-roofed building extraction from UAV imagery[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 6514405

[13]	XU R, WANG C, ZHANG J, et al RSSFormer: foreground saliency enhancement for remote sensing land-cover segmentation[J]. IEEE Transactions on Image Processing, 2023, 32: 1052- 1064 doi: 10.1109/TIP.2023.3238648

[14]	YU C, WANG J, PENG C, et al. BiSeNet: bilateral segmentation network for real-time semantic segmentation [C]// Computer Vision – ECCV 2018. [S.l.]: Springer, 2018: 334–349.

[15]	YU C, GAO C, WANG J, et al BiSeNet V2: bilateral network with guided aggregation for real-time semantic segmentation[J]. International Journal of Computer Vision, 2021, 129 (11): 3051- 3068 doi: 10.1007/s11263-021-01515-2

[16]	WANG L, LI R, ZHANG C, et al UNetFormer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2022, 190: 196- 214 doi: 10.1016/j.isprsjprs.2022.06.008

[17]	WADEKAR S N, CHAURASIA A. MobileViTv3: mobile-friendly vision transformer with simple and effective fusion of local, global and input features [EB/OL]. (2022–10–06) [2024–08–31]. https://arxiv.org/abs/2209.15159.

[18]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.

[19]	LYU Y, VOSSELMAN G, XIA G S, et al UAVid: a semantic segmentation dataset for UAV imagery[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 165: 108- 119 doi: 10.1016/j.isprsjprs.2020.05.009

[20]	WANG J, ZHENG Z, MA A, et al. LoveDA: a remote sensing land-cover dataset for domain adaptive semantic segmentation [EB/OL]. (2021–10–17)[2024–08–31]. https://arxiv.org/abs/2110.08733.

[21]	LIU S, HUANG D, WANG Y. Receptive field block net for accurate and fast object detection [C]// Computer Vision – ECCV 2018. [S.l.]: Springer, 2018: 404–419.

[22]	WANG L, LI R, WANG D, et al Transformer meets convolution: a bilateral awareness network for semantic segmentation of very fine resolution urban scene images[J]. Remote Sensing, 2021, 13 (16): 3065 doi: 10.3390/rs13163065

[23]	XU W, XU Y, CHANG T, et al. Co-scale conv-attentional image transformers [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9961–9970.

[24]	STRUDEL R, GARCIA R, LAPTEV I, et al. Segmenter: transformer for semantic segmentation [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 7242–7252.

[1]	Mingjun SONG,Wen YAN,Yizhao DENG,Junran ZHANG,Haiyan TU. Light-weight algorithm for real-time robotic grasp detection[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 599-610.

[2]	Pei-zhi WEN,Jun-mou CHEN,Yan-nan XIAO,Ya-yuan WEN,Wen-ming HUANG. Underwater image enhancement algorithm based on GAN and multi-level wavelet CNN[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 213-224.

[3]	Yan-nan ZHANG,Xiao-hong HUANG,Yan MA,Qun CONG. Method with recording text classification based on deep learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(7): 1264-1271.

[4]	GUO Bao-zhen, ZUO Wan-li, WANG Ying. Double CNN sentence classification model with attention mechanism of word embeddings[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(9): 1729-1737.

Viewed

Full text

Abstract

Cited

Shared

Discussed