Object detection algorithm based on multi-azimuth perception deep fusion detection head

doi:10.3785/j.issn.1008-973X.2026.01.003

Journal of ZheJiang University (Engineering Science)

2026, Vol. 60

Issue (1): 32-42 DOI: 10.3785/j.issn.1008-973X.2026.01.003

Object detection algorithm based on multi-azimuth perception deep fusion detection head

Xiao’an BAO1(

),Shuyou PENG1,Na ZHANG1,Xiaomei TU2,Qingqi ZHANG3,Biao WU4,*(

)

1. School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
2. School of Civil Engineering and Architecture, Zhejiang Guangsha Vocational and Technical University of Construction, Dongyang 322100, China
3. Graduate School of East Asian Studies, Yamaguchi University, Yamaguchi 753-8514, Japan
4. School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China

Download:

HTML

PDF(3033KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

An object detection algorithm based on multi-azimuth perception deep fusion detection head was proposed to address the challenge that traditional object detection heads struggled to effectively capture global information. An efficient dual-axial-window attention encoder (EDWE) module was designed to enable the network to deeply fuse the captured global information and local information. A reparameterized large kernel convolution (RLK) module was employed after the feature pyramid structure to alleviate feature space discrepancies from the backbone network and enhance the network’s adaptability to small and medium-sized datasets. An encoder selective-save module (ESM) was introduced to selectively accumulate the outputs from the EDWE module and optimize the backpropagation process. Experimental results demonstrated that on the larger-scale MS-COCO2017 dataset, the AP values were improved by 2.9, 2.6, and 3.4 percentage points when the proposed algorithm was applied to the common models RetinaNet, FCOS, and ATSS, respectively. On the smaller-scale PASCAL VOC2007 dataset, the proposed algorithm achieved improvements of 1.3, 1.0, and 1.1 percentage points in the AP values of the three models respectively. Through the synergistic integration of the EDWE, RLK, and ESM modules, the proposed algorithm effectively enhances the object detection accuracy and has significant performance advantages across datasets of varying scales.

Key words： detection head object detection Transformer encoder deep fusion large kernel convolution

Received: 11 December 2024 Published: 15 December 2025

CLC:

TP 391

Fund: 国家自然科学基金资助项目(6207050141)；浙江省重点研发计划资助项目(2020C03094)；浙江省教育厅一般科研项目(Y202147659)；浙江省教育厅项目(Y202250706, Y202250677)；浙江省基础公益研究计划资助项目(QY19E050003).

Corresponding Authors: Biao WU E-mail: baoxiaoan@zstu.edu.cn;biaowuzg@zstu.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Xiao’an BAO
	Shuyou PENG
	Na ZHANG
	Xiaomei TU
	Qingqi ZHANG
	Biao WU

Cite this article:

Xiao’an BAO,Shuyou PENG,Na ZHANG,Xiaomei TU,Qingqi ZHANG,Biao WU. Object detection algorithm based on multi-azimuth perception deep fusion detection head. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 32-42.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.01.003 OR https://www.zjujournals.com/eng/Y2026/V60/I1/32

基于多方位感知深度融合检测头的目标检测算法

针对传统目标检测头难以有效捕捉全局信息的问题，提出基于多方位感知深度融合检测头的目标检测算法. 通过在检测头部分设计高效双轴窗口注意力编码器（EDWE）模块，使网络能够深度融合捕获到的全局信息与局部信息；在特征金字塔结构之后使用重参化大核卷积（RLK）模块，减小来自主干网络的特征空间差异，增强网络对中小型数据集的适应性；引入编码器选择保留模块（ESM），选择性地累积来自EDWE模块的输出，优化反向传播. 实验结果表明，在规模较大的MS-COCO2017数据集上，所提算法应用于常见模型RetinaNet、FCOS、ATSS时使AP分别提升了2.9、2.6、3.4个百分点；在规模较小的PASCAL VOC2007数据集上，所提算法使3种模型的AP分别实现了1.3、1.0和1.1个百分点的提升. 通过EDWE、RLK和ESM模块的协同作用，所提算法有效提升了目标检测精度，在不同规模的数据集上均展现了显著的性能优势.

关键词： 检测头, 目标检测, Transformer编码器, 深度融合, 大核卷积

Fig.1 Network structure of object detection algorithm based on multi-azimuth perception deep fusion detection head

Fig.2 Schematic diagram of efficient dual-axial attention module

Fig.3 Structure of efficient dual-axial-window attention encoder (EDWE) module

Fig.4 Visualization heatmaps of feature-focused regions generated by different attention modules

Fig.5 Structure of reparameterized large kernel convolution (RLK) module

Fig.6 Phenomenon of incorrect predictions occurring in later encoding stages

Fig.7 Schematic diagrams of original module, encoder dense-save module (EDM) and encoder selective-save module (ESM)

Tab.1 Results of applying MdfHead to different object detectors on MS-COCO2017 dataset

Tab.2 Ablation experiment of different modules in MdfHead

Tab.3 Comparison experiment with different encoder modules

Tab.4 Ablation experiment on the number of EDWE modules

Tab.5 Experiment on selection of convolution kernel sizes in RLK module

Tab.6 Experiment on effectiveness of RLK module on smaller dataset

Tab.7 Ablation experiment on ESM

Fig.8 Experimental results of visualization comparison between MdfHead and original detection head

Fig.9 Detection accuracy of MdfHead and original detection head on different species in wildlife dataset

Tab.8 Comparison experiment of different detection heads on wildlife dataset


[1]	REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149 doi: 10.1109/TPAMI.2016.2577031

[2]	LI W, ZHAO D, YUAN B, et al PETDet: proposal enhancement for two-stage fine-grained object detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 62: 5602214

[3]	LI H, SHI F A DETR-like detector-based semi-supervised object detection method for Brassica Chinensis growth monitoring[J]. Computers and Electronics in Agriculture, 2024, 219: 108788 doi: 10.1016/j.compag.2024.108788

[4]	HOU X, LIU M, ZHANG S, et al. Relation DETR: exploring explicit position relation prior for object detection [C]// Proceedings of the European Conference on Computer Vision. Milan: Springer, 2024: 89–105.

[5]	ZHAO Y, LV W, XU S, et al. DETRs beat YOLOs on real-time object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 16965–16974.

[6]	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers [C]// Proceedings of the European Conference on Computer Vision. Glasgow: Springer, 2020: 213–229.

[7]	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2999–3007.

[8]	TIAN Z, SHEN C, CHEN H, et al. FCOS: fully convolutional one-stage object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 9626–9635.

[9]	DUAN K, BAI S, XIE L, et al. CenterNet: keypoint triplets for object detection [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 6568–6577.

[10]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 3–19.

[11]	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 936–944.

[12]	CHEN F, ZHANG H, HU K, et al. Enhanced training of query-based object detection via selective query recollection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 23756–23765.

[13]	REDMON J, FARHADI A. YOLOv3: an incremental improvement [EB/OL]. (2018−04−08) [2024−10−07]. https://arxiv.org/abs/1804.02767.

[14]	BOCHKOVSKIY A, WANG C Y, LIAO H Y M. YOLOv4: optimal speed and accuracy of object detection [EB/OL]. (2020−04-23) [2024−10−07]. https://arxiv.org/abs/2004.10934.

[15]	TIAN Z, CHU X, WANG X, et al. Fully convolutional one-stage 3D object detection on LiDAR range images [EB/OL]. (2022−09−20) [2024−10−07]. https://arxiv.org/abs/2205.13764.

[16]	GE Z, LIU S, WANG F, et al. YOLOX: exceeding YOLO series in 2021 [EB/OL]. (2021−08−06) [2024−10−07]. https://arxiv.org/abs/2107.08430.

[17]	WU Y, CHEN Y, YUAN L, et al. Rethinking classification and localization for object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 10183–10192.

[18]	DAI X, CHEN Y, XIAO B, et al. Dynamic head: unifying object detection heads with attentions [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 7369–7378.

[19]	LIANG J, SONG G, LENG B, et al. Unifying visual perception by dispersible points learning [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: Springer, 2022: 439–456.

[20]	HU J, SHEN L, SUN G. Squeeze-and-excitation networks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 7132–7141.

[21]	DING X, ZHANG X, HAN J, et al. Scaling up your kernels to 31×31: revisiting large kernel design in CNNs [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 11953–11965.

[22]	LIU Z, LIN Y, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 9992–10002.

[23]	ZHOU H, YANG R, ZHANG Y, et al UniHead: unifying multi-perception for detection heads[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36 (5): 9565- 9576

[24]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc, 2017: 6000–6010.

[25]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// Proceedings of the European Conference on Computer Vision. Zurich: Springer, 2014: 740–755.

[26]	EVERINGHAM M, VAN GOOL L, WILLIAMS C K I, et al The pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88 (2): 303- 338 doi: 10.1007/s11263-009-0275-4

[27]	CHEN K, WANG J, PANG J, et al. MMDetection: open MMLab detection toolbox and benchmark. [EB/OL]. (2019−06−17) [2024−10−07]. https://arxiv.org/abs/1906.07155.

[28]	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 248–255.

[29]	ZHANG S, CHI C, YAO Y, et al. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 9756–9765.

[30]	KIM K, LEE H S. Probabilistic anchor assignment with IoU prediction for object detection [C]// Proceedings of the European Conference on Computer Vision. Glasgow: Springer, 2020: 355–371.

[1]	Jian XIAO,Xinze HE,Hongliang CHENG,Xiaoyuan YANG,Xin HU. Aerial small target detection algorithm based on multi-scale feature enhancement[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(1): 19-31.

[2]	Chaoqun DONG,Zhan WANG,Ping LIAO,Shuai XIE,Yujie RONG,Jingsong ZHOU. Lightweight YOLOv5s-OCG rail sleeper crack detection algorithm[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1838-1845.

[3]	Jiarui FU,Zhaofei LI,Hao ZHOU,Wei HUANG. Camouflaged object detection based on Convnextv2 and texture-edge guidance[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1718-1726.

[4]	Huizhi XU,Xiuqing WANG. Perception of distance and speed of front vehicle based on vehicle image features[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1219-1232.

[5]	Shenchong LI,Xinhua ZENG,Chuanqu LIN. Multi-task environment perception algorithm for autonomous driving based on axial attention[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(4): 769-777.

[6]	Hongzhao DONG,Shaoxuan LIN,Yini SHE. Research progress of YOLO detection technology for traffic object[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(2): 249-260.

[7]	Yongfu HE,Shiwei XIE,Jialu YU,Siyu CHEN. Detection method for spillage risk vehicle considering cross-level feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(2): 300-309.

[8]	Jianghao CHEN,Jun YANG. Object detection for multi-source remote sensing fused images based on depthwise separable convolution[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(12): 2545-2555.

[9]	Jiayi YU,Qin WU. Monocular 3D object detection based on context information enhancement and depth guidance[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(1): 89-99.

[10]	Juan SONG,Longxi HE,Huiping LONG. Deep learning-based algorithm for multi defect detection in tunnel lining[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1161-1173.

[11]	Yin CAO,Junping QIN,Tong GAO,Qianli MA,Jiaqi REN. Generative adversarial network based two-stage generation of high-quality images from text[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 674-683.

[12]	Tianmin DENG,Xinxin CHENG,Jinfeng LIU,Xiyue ZHANG. Small target detection algorithm for aerial images based on feature reuse mechanism[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 437-448.

[13]	Yaolian SONG,Can WANG,Dayan LI,Xinyi LIU. UAV small target detection algorithm based on improved YOLOv5s[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2417-2426.

[14]	Dingjian DU,Zunhai GAO,Zhuo CHEN. Wolfberry pest detection based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(10): 1992-2000.

[15]	Siyi QIN,Shaoyan GAI,Feipeng DA. Video object detection algorithm based on multi-level feature aggregation under mixed sampler[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 10-19.

Viewed

Full text

Abstract

Cited

Shared

Discussed