Single object tracking algorithm based on spatio-temporal feature enhancement

doi:10.3785/j.issn.1008-973X.2025.11.021

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (11): 2418-2429 DOI: 10.3785/j.issn.1008-973X.2025.11.021

Single object tracking algorithm based on spatio-temporal feature enhancement

Lei GU(

),Nan XIA*(

),Jiahong JIANG,Xiaoyu LIAN

School of Information Science and Engineering, Dalian Polytechnic University, Dalian 116034, China

Download:

HTML

PDF(4285KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A single object tracking algorithm based on spatio-temporal feature enhancement (OSTrack-ST), which was built on the one-stream tracking network OSTrack, was proposed to address the common issues of occlusion and scale variation in complex motion scenes and enhance the performance of single object tracking algorithms in utilizing temporal feature information and expressing object spatial features. For spatial feature enhancement, a multi-head spatial association attention mechanism including the spatial attention and the multi-head context association attention was proposed to enhance the model’s ability to express global and local spatial features, and effectively improve the model’s ability to capture object features in dynamic environments. For spatio-temporal feature enhancement, a spatio-temporal template update strategy based on temporal drift prediction was proposed, which used spatial position prediction results to control template updates over time and enhanced the robustness and accuracy of the model in long-term sequential tasks. Experimental results demonstrated that the proposed algorithm achieved tracking success rates of 70.5%, 73.7% and 68.7% on the LaSOT, GOT-10k and SportSOT datasets while the running speed was over 49 frame per second. The overall performance of this algorithm was better than that of other tracking algorithms such as EVPTrack.

Key words： object tracking Siamese network spatio-temporal enhancement attention mechanism template update

Received: 11 January 2025 Published: 30 October 2025

CLC:

TP 391

Fund: 教育部产学合作协同育人资助项目（220603231024713）.

Corresponding Authors: Nan XIA E-mail: 220520854000562@xy.dlpu.edu.cn;xianan@dlpu.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Lei GU
	Nan XIA
	Jiahong JIANG
	Xiaoyu LIAN

Cite this article:

Lei GU,Nan XIA,Jiahong JIANG,Xiaoyu LIAN. Single object tracking algorithm based on spatio-temporal feature enhancement. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2418-2429.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.11.021 OR https://www.zjujournals.com/eng/Y2025/V59/I11/2418

基于时空特征增强的单目标跟踪算法

针对复杂运动场景中常见的遮挡和尺度变化问题，为了提升单目标跟踪算法在时间特征信息利用和目标空间特征表达上的综合能力，在单流跟踪网络OSTrack的基础上，提出基于时空特征增强的单目标跟踪算法OSTrack-ST.在空间特征增强方面，提出包含空间注意力和多头上下文关联注意力的多头空间关联注意力机制，增强模型对空间全局特征和局部特征的表达能力，有效提升模型在动态环境中对目标特征的捕获能力；在时空特征增强方面，提出基于时序漂移预测的时空模板更新策略，利用空间位置预测结果来控制时序模板更新，提升模型在长时序任务中的鲁棒性和准确性.实验结果表明，所提算法在LaSOT、GOT-10k和SportSOT数据集上的跟踪成功率分别达到了70.5%、73.7%和68.7%，运行速度超过49帧/s. 此算法的综合性能优于EVPTrack等其他跟踪算法.

关键词： 目标跟踪, 孪生网络, 时空增强, 注意力机制, 模板更新

Fig.1 Overall network architecture of single object tracking algorithm based on spatio-temporal feature enhancement

Fig.2 Structural diagram of spatial attention module

Fig.3 Structural diagram of multi-head context association attention module

Fig.4 Structural diagram of spatio-temporal template update strategy

Tab.1 Comparison of tracking results of different algorithms on LaSOT and GOT-10k test sets

Tab.2 Comparison of tracking results of different algorithms on SportsSOT test set

Tab.3 Comparison of real-time efficiency between proposed and other tracking algorithms on LaSOT benchmark dataset

Fig.5 AUC performance of different algorithms in scenarios with different attribute of LaSOT dataset

Tab.4 Results of ablation experiments on MixViT和OSTrack algorithms

Fig.6 Comparison of feature maps of MixViT and OSTrack adopting different improvement strategies

Fig.7 Comparison of tracking results of different algorithms in complex scenarios


[1]	JAVED S, DANELLJAN M, KHAN F S, et al Visual object tracking with discriminative filters and Siamese networks: a survey and outlook[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (5): 6552- 6574

[2]	孙训红, 都海波, 陈维乐, 等基于移动机器人机载视觉云台的有限时间目标跟踪控制[J]. 控制与决策, 2023, 38 (10): 2875- 2880 SUN Xunhong, DU Haibo, CHEN Weile, et al Finite-time target tracking control based on mobile robot’s onboard PanTilt-Zoom camera system[J]. Control and Decision, 2023, 38 (10): 2875- 2880

[3]	江佳鸿, 夏楠, 李长吾, 等基于多尺度增量学习的单人体操动作中关键点检测方法[J]. 电子学报, 2024, 52 (5): 1730- 1742 JIANG Jiahong, XIA Nan, LI Changwu, et al Keypoint detection method for single person gymnastics actions based on multi-scale incremental learning[J]. Acta Electronica Sinica, 2024, 52 (5): 1730- 1742

[4]	MARVASTI-ZADEH S M, CHENG L, GHANEI-YAKHDAN H, et al Deep learning for visual tracking: a comprehensive survey[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23 (5): 3943- 3968 doi: 10.1109/TITS.2020.3046478

[5]	卢湖川, 李佩霞, 王栋目标跟踪算法综述[J]. 模式识别与人工智能, 2018, 31 (1): 61- 76 LU Huchuan, LI Peixia, WANG Dong Visual object tracking: a survey[J]. Pattern Recognition and Artificial Intelligence, 2018, 31 (1): 61- 76

[6]	DU S, WANG S An overview of correlation-filter-based object tracking[J]. IEEE Transactions on Computational Social Systems, 2022, 9 (1): 18- 31 doi: 10.1109/TCSS.2021.3093298

[7]	张津浦, 王岳环融合检测技术的孪生网络跟踪算法综述[J]. 红外与激光工程, 2022, 51 (10): 1- 14 ZHANG Jinpu, WANG Yuehuan A survey of Siamese networks tracking algorithm integrating detection technology[J]. Infrared and Laser Engineering, 2022, 51 (10): 1- 14

[8]	LI B, YAN J, WU W, et al. High performance visual tracking with Siamese region proposal network [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 8971–8980.

[9]	XU Y, WANG Z, LI Z, et al. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI, 2020: 12549–12556.

[10]	CHEN D, TANG F, DONG W, et al SiamCPN: visual tracking with the Siamese center-prediction network[J]. Computational Visual Media, 2021, 7 (2): 253- 265 doi: 10.1007/s41095-021-0212-1

[11]	ZHANG L, GONZALEZ-GARCIA A, VAN DE WEIJER J, et al. Learning the model update for Siamese trackers [C]// IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4009–4018.

[12]	SARIBAS H, CEVIKALP H, KÖPÜKLÜ O, et al TRAT: tracking by attention using spatio-temporal features[J]. Neurocomputing, 2022, 492 (1): 150- 161

[13]	MAYER C, DANELLJAN M, PANI PAUDEL D, et al. Learning target candidate association to keep track of what not to track [C]// IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 13424–13434.

[14]	王蒙蒙, 杨小倩, 刘勇利用时空特征编码的单目标跟踪网络[J]. 中国图象图形学报, 2022, 27 (9): 2733- 2748 WANG Mengmeng, YANG Xiaoqian, LIU Yong A spatio-temporal encoded network for single object tracking[J]. Journal of Image and Graphics, 2022, 27 (9): 2733- 2748 doi: 10.11834/jig.211157

[15]	HAN K, WANG Y, CHEN H, et al A survey on vision Transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (1): 87- 110 doi: 10.1109/TPAMI.2022.3152247

[16]	CHEN X, YAN B, ZHU J, et al. Transformer tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 8122–8131.

[17]	YAN B, PENG H, FU J, et al. Learning spatio-temporal Transformer for visual tracking [C]// IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 10428–10437.

[18]	GAO S, ZHOU C, MA C, et al. AiATrack: attention in attention for Transformer visual tracking [C]// European Conference on Computer Vision. Tel Aviv: Springer, 2022: 146–164.

[19]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.

[20]	WANG N, ZHOU W, WANG J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 1571–1580.

[21]	LING L T, FAN H, ZHANG Z P, et al. SwinTrack: a simple and strong baseline for Transformer tracking [C]// Conference on Neural Information Processing Systems. New Orleans: [s. n.], 2022: 16743–16754.

[22]	侯志强, 杨晓麟, 马素刚, 等基于特征增强和历史帧选择的Transformer视觉跟踪算法[J]. 控制与决策, 2024, 39 (10): 3506- 3512 HOU Zhiqiang, YANG Xiaolin, MA Sugang, et al Feature enhancement and history frame selection based Transformer visual tracking[J]. Control and Decision, 2024, 39 (10): 3506- 3512

[23]	CUI Y, JIANG C, WU G, et al MixFormer: end-to-end tracking with iterative mixed attention[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (6): 4129- 4146 doi: 10.1109/TPAMI.2024.3349519

[24]	LIU C, ZHAO J, BO C, et al LGTrack: exploiting local and global properties for robust visual tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (9): 8161- 8171 doi: 10.1109/TCSVT.2024.3390054

[25]	SHI L, ZHONG B, LIANG Q, et al. Explicit visual prompts for visual object tracking [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 4838–4846.

[26]	YE B, CHANG H, MA B, et al. Joint feature learning and relation modeling for tracking: a one-stream framework [C]// European Conference on Computer Vision. Tel Aviv: Springer, 2022: 341–357.

[27]	WANG Y, DENG L, ZHENG Y, et al Temporal convolutional network with soft thresholding and attention mechanism for machinery prognostics[J]. Journal of Manufacturing Systems, 2021, 60 (1): 512- 526

[28]	FAN H, LIN L T, YANG F, et al. LaSOT: a high-quality benchmark for large-scale single object tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5369–5378.

[29]	HUANG L, ZHAO X, HUANG K GOT-10k: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43 (5): 1562- 1577 doi: 10.1109/TPAMI.2019.2957464

[30]	CUI Y, ZENG C, ZHAO X, et al. SportsMOT: a large multi-object tracking dataset in multiple sports scenes [C]// IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 9887–9897.

[31]	HE K, CHEN X, XIE S, et al. Masked autoencoders are scalable vision learners [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 15979–15988.

[1]	Fujian WANG,Zetian ZHANG,Xiqun CHEN,Dianhai WANG. Usage prediction of shared bike based on multi-channel graph aggregation attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1986-1995.

[2]	Yahong ZHAI,Yaling CHEN,Longyan XU,Yu GONG. Improved YOLOv8s lightweight small target detection algorithm of UAV aerial image[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1708-1717.

[3]	Jiarui FU,Zhaofei LI,Hao ZHOU,Wei HUANG. Camouflaged object detection based on Convnextv2 and texture-edge guidance[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1718-1726.

[4]	Xuejun ZHANG,Shubin LIANG,Wanrong BAI,Fenghe ZHANG,Haiyan HUANG,Meifeng GUO,Zhuo CHEN. Source code vulnerability detection method based on heterogeneous graph representation[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1644-1652.

[5]	Yishan LIN,Jing ZUO,Shuhua LU. Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1653-1661.

[6]	Rongtai YANG,Yubin SHAO,Qingzhi DU. Structure-aware model for few-shot knowledge completion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1394-1402.

[7]	Shengju WANG,Zan ZHANG. Missing value imputation algorithm based on accelerated diffusion model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1471-1480.

[8]	Yongqing CAI,Cheng HAN,Wei QUAN,Wudi CHEN. Visual induced motion sickness estimation model based on attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1110-1118.

[9]	Wenbo JU,Huajun DONG. Motherboard defect detection method based on context information fusion and dynamic sampling[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1159-1168.

[10]	Xiangyu ZHOU,Yizhi LIU,Yijiang ZHAO,Zhuhua LIAO,Decheng ZHANG. Hierarchical spatial embedding BiGRU model for destination prediction[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1211-1218.

[11]	Zongmin LI,Chang XU,Yun BAI,Shiyang XIAN,Guangcai RONG. Dual-neighborhood graph convolution method for point cloud understanding[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 879-889.

[12]	Hongwei LIU,Lei WANG,Yang LIU,Pengchao ZHANG,Shi QIAO. Short term load forecasting based on recombination quadratic decomposition and LSTNet-Atten[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 1051-1062.

[13]	Dengfeng LIU,Wenjing GUO,Shihai CHEN. Content-guided attention-based lane detection network[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 451-459.

[14]	Minghui YAO,Yueyan WANG,Qiliang WU,Yan NIU,Cong WANG. Siamese networks algorithm based on small human motion behavior recognition[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 504-511.

[15]	Xianglei YIN,Shaopeng QU,Yongfang XIE,Ni SU. Occluded bird nest detection based on asymptotic feature fusion and multi-scale dilated attention[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 535-545.

Viewed

Full text

Abstract

Cited

Shared

Discussed