Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2025, Vol. 59 Issue (11): 2418-2429    DOI: 10.3785/j.issn.1008-973X.2025.11.021
    
Single object tracking algorithm based on spatio-temporal feature enhancement
Lei GU(),Nan XIA*(),Jiahong JIANG,Xiaoyu LIAN
School of Information Science and Engineering, Dalian Polytechnic University, Dalian 116034, China
Download: HTML     PDF(4285KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A single object tracking algorithm based on spatio-temporal feature enhancement (OSTrack-ST), which was built on the one-stream tracking network OSTrack, was proposed to address the common issues of occlusion and scale variation in complex motion scenes and enhance the performance of single object tracking algorithms in utilizing temporal feature information and expressing object spatial features. For spatial feature enhancement, a multi-head spatial association attention mechanism including the spatial attention and the multi-head context association attention was proposed to enhance the model’s ability to express global and local spatial features, and effectively improve the model’s ability to capture object features in dynamic environments. For spatio-temporal feature enhancement, a spatio-temporal template update strategy based on temporal drift prediction was proposed, which used spatial position prediction results to control template updates over time and enhanced the robustness and accuracy of the model in long-term sequential tasks. Experimental results demonstrated that the proposed algorithm achieved tracking success rates of 70.5%, 73.7% and 68.7% on the LaSOT, GOT-10k and SportSOT datasets while the running speed was over 49 frame per second. The overall performance of this algorithm was better than that of other tracking algorithms such as EVPTrack.



Key wordsobject tracking      Siamese network      spatio-temporal enhancement      attention mechanism      template update     
Received: 11 January 2025      Published: 30 October 2025
CLC:  TP 391  
Fund:  教育部产学合作协同育人资助项目(220603231024713).
Corresponding Authors: Nan XIA     E-mail: 220520854000562@xy.dlpu.edu.cn;xianan@dlpu.edu.cn
Cite this article:

Lei GU,Nan XIA,Jiahong JIANG,Xiaoyu LIAN. Single object tracking algorithm based on spatio-temporal feature enhancement. Journal of ZheJiang University (Engineering Science), 2025, 59(11): 2418-2429.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.11.021     OR     https://www.zjujournals.com/eng/Y2025/V59/I11/2418


基于时空特征增强的单目标跟踪算法

针对复杂运动场景中常见的遮挡和尺度变化问题,为了提升单目标跟踪算法在时间特征信息利用和目标空间特征表达上的综合能力,在单流跟踪网络OSTrack的基础上,提出基于时空特征增强的单目标跟踪算法OSTrack-ST.在空间特征增强方面,提出包含空间注意力和多头上下文关联注意力的多头空间关联注意力机制,增强模型对空间全局特征和局部特征的表达能力,有效提升模型在动态环境中对目标特征的捕获能力;在时空特征增强方面,提出基于时序漂移预测的时空模板更新策略,利用空间位置预测结果来控制时序模板更新,提升模型在长时序任务中的鲁棒性和准确性.实验结果表明,所提算法在LaSOT、GOT-10k和SportSOT数据集上的跟踪成功率分别达到了70.5%、73.7%和68.7%,运行速度超过49帧/s. 此算法的综合性能优于EVPTrack等其他跟踪算法.


关键词: 目标跟踪,  孪生网络,  时空增强,  注意力机制,  模板更新 
Fig.1 Overall network architecture of single object tracking algorithm based on spatio-temporal feature enhancement
Fig.2 Structural diagram of spatial attention module
Fig.3 Structural diagram of multi-head context association attention module
Fig.4 Structural diagram of spatio-temporal template update strategy
算法LaSOTGOT-10k
AUC/%Pnorm/%P/%AO/%SR50/%SR75/%
SiamRPN[8]43.747.841.540.846.419.8
TAN[14]47.453.545.3
SiamFC++[9]54.162.254.659.469.347.1
TransT[16]64.873.769.066.976.760.8
STARK[17]67.276.968.978.064.2
KeepTrack[13]67.377.470.468.379.361.0
MixViT[23]68.778.474.370.479.867.9
OSTrack[26]69.078.675.170.980.268.2
SwinTrack[21]69.278.374.069.478.064.3
AiATrack[18]69.379.273.569.980.163.5
FEHST[22]70.178.875.171.481.768.3
LGTrack[24]70.280.476.472.482.369.6
EVPTrack[25]70.480.677.473.383.670.7
OSTrack-ST70.580.877.173.784.171.4
Tab.1 Comparison of tracking results of different algorithms on LaSOT and GOT-10k test sets
算法AUC/%Pnorm/%P/%
MixViT66.473.269.0
OSTrack66.173.769.4
FEHST67.074.370.9
LGTrack67.174.571.7
EVPTrack68.475.973.5
OSTrack-ST68.776.273.7
Tab.2 Comparison of tracking results of different algorithms on SportsSOT test set
模型分辨率(模板, 搜索)AUC/%FPS/(帧·s?1)FLOPs/109Np/106
TransT(128, 256)64.845.716.723.0
STARK(128, 320)67.243.618.547.2
MixViT(128, 288)/(192, 384)68.7/71.950.3/10.820.9/113.197.4/195.4
OSTrack(128, 256)/(192, 384)69.0/71.2101.3/44.621.5/48.292.7/92.7
LGTrack(128, 256)/(192, 384)70.2/71.429.4/16.539.2/92.787.9/87.9
EVPTrack(128, 256)/(192, 384)70.4/72.370.7/27.635.7/69.173.7/73.7
OSTrack-ST(128, 256)/(192, 384)70.5/72.649.8/18.932.4/74.598.3/98.3
Tab.3 Comparison of real-time efficiency between proposed and other tracking algorithms on LaSOT benchmark dataset
Fig.5 AUC performance of different algorithms in scenarios with different attribute of LaSOT dataset
算法改进策略LaSOTGOT-10kSportsSOT
(1)(2)AUC/%Pnorm/%P/%AO/%SR50/%SR75/%AUC/%Pnorm/%P/%
MixViT××68.778.474.370.479.867.966.473.269.0
MixViT×69.579.274.871.281.468.766.773.369.5
MixViT×69.479.675.470.982.167.967.474.472.3
MixViT70.280.076.573.783.670.467.874.972.7
OSTrack××69.078.675.170.980.268.266.173.769.4
OSTrack×69.679.176.072.282.069.867.474.971.8
OSTrack×70.279.776.272.981.770.467.975.272.9
OSTrack-ST70.580.877.273.784.171.468.776.273.7
Tab.4 Results of ablation experiments on MixViT和OSTrack algorithms
Fig.6 Comparison of feature maps of MixViT and OSTrack adopting different improvement strategies
Fig.7 Comparison of tracking results of different algorithms in complex scenarios
[1]   JAVED S, DANELLJAN M, KHAN F S, et al Visual object tracking with discriminative filters and Siamese networks: a survey and outlook[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (5): 6552- 6574
[2]   孙训红, 都海波, 陈维乐, 等 基于移动机器人机载视觉云台的有限时间目标跟踪控制[J]. 控制与决策, 2023, 38 (10): 2875- 2880
SUN Xunhong, DU Haibo, CHEN Weile, et al Finite-time target tracking control based on mobile robot’s onboard PanTilt-Zoom camera system[J]. Control and Decision, 2023, 38 (10): 2875- 2880
[3]   江佳鸿, 夏楠, 李长吾, 等 基于多尺度增量学习的单人体操动作中关键点检测方法[J]. 电子学报, 2024, 52 (5): 1730- 1742
JIANG Jiahong, XIA Nan, LI Changwu, et al Keypoint detection method for single person gymnastics actions based on multi-scale incremental learning[J]. Acta Electronica Sinica, 2024, 52 (5): 1730- 1742
[4]   MARVASTI-ZADEH S M, CHENG L, GHANEI-YAKHDAN H, et al Deep learning for visual tracking: a comprehensive survey[J]. IEEE Transactions on Intelligent Transportation Systems, 2022, 23 (5): 3943- 3968
doi: 10.1109/TITS.2020.3046478
[5]   卢湖川, 李佩霞, 王栋 目标跟踪算法综述[J]. 模式识别与人工智能, 2018, 31 (1): 61- 76
LU Huchuan, LI Peixia, WANG Dong Visual object tracking: a survey[J]. Pattern Recognition and Artificial Intelligence, 2018, 31 (1): 61- 76
[6]   DU S, WANG S An overview of correlation-filter-based object tracking[J]. IEEE Transactions on Computational Social Systems, 2022, 9 (1): 18- 31
doi: 10.1109/TCSS.2021.3093298
[7]   张津浦, 王岳环 融合检测技术的孪生网络跟踪算法综述[J]. 红外与激光工程, 2022, 51 (10): 1- 14
ZHANG Jinpu, WANG Yuehuan A survey of Siamese networks tracking algorithm integrating detection technology[J]. Infrared and Laser Engineering, 2022, 51 (10): 1- 14
[8]   LI B, YAN J, WU W, et al. High performance visual tracking with Siamese region proposal network [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 8971–8980.
[9]   XU Y, WANG Z, LI Z, et al. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines [C]// Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI, 2020: 12549–12556.
[10]   CHEN D, TANG F, DONG W, et al SiamCPN: visual tracking with the Siamese center-prediction network[J]. Computational Visual Media, 2021, 7 (2): 253- 265
doi: 10.1007/s41095-021-0212-1
[11]   ZHANG L, GONZALEZ-GARCIA A, VAN DE WEIJER J, et al. Learning the model update for Siamese trackers [C]// IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019: 4009–4018.
[12]   SARIBAS H, CEVIKALP H, KÖPÜKLÜ O, et al TRAT: tracking by attention using spatio-temporal features[J]. Neurocomputing, 2022, 492 (1): 150- 161
[13]   MAYER C, DANELLJAN M, PANI PAUDEL D, et al. Learning target candidate association to keep track of what not to track [C]// IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 13424–13434.
[14]   王蒙蒙, 杨小倩, 刘勇 利用时空特征编码的单目标跟踪网络[J]. 中国图象图形学报, 2022, 27 (9): 2733- 2748
WANG Mengmeng, YANG Xiaoqian, LIU Yong A spatio-temporal encoded network for single object tracking[J]. Journal of Image and Graphics, 2022, 27 (9): 2733- 2748
doi: 10.11834/jig.211157
[15]   HAN K, WANG Y, CHEN H, et al A survey on vision Transformer[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (1): 87- 110
doi: 10.1109/TPAMI.2022.3152247
[16]   CHEN X, YAN B, ZHU J, et al. Transformer tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 8122–8131.
[17]   YAN B, PENG H, FU J, et al. Learning spatio-temporal Transformer for visual tracking [C]// IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 10428–10437.
[18]   GAO S, ZHOU C, MA C, et al. AiATrack: attention in attention for Transformer visual tracking [C]// European Conference on Computer Vision. Tel Aviv: Springer, 2022: 146–164.
[19]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
[20]   WANG N, ZHOU W, WANG J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021: 1571–1580.
[21]   LING L T, FAN H, ZHANG Z P, et al. SwinTrack: a simple and strong baseline for Transformer tracking [C]// Conference on Neural Information Processing Systems. New Orleans: [s. n.], 2022: 16743–16754.
[22]   侯志强, 杨晓麟, 马素刚, 等 基于特征增强和历史帧选择的Transformer视觉跟踪算法[J]. 控制与决策, 2024, 39 (10): 3506- 3512
HOU Zhiqiang, YANG Xiaolin, MA Sugang, et al Feature enhancement and history frame selection based Transformer visual tracking[J]. Control and Decision, 2024, 39 (10): 3506- 3512
[23]   CUI Y, JIANG C, WU G, et al MixFormer: end-to-end tracking with iterative mixed attention[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (6): 4129- 4146
doi: 10.1109/TPAMI.2024.3349519
[24]   LIU C, ZHAO J, BO C, et al LGTrack: exploiting local and global properties for robust visual tracking[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (9): 8161- 8171
doi: 10.1109/TCSVT.2024.3390054
[25]   SHI L, ZHONG B, LIANG Q, et al. Explicit visual prompts for visual object tracking [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 4838–4846.
[26]   YE B, CHANG H, MA B, et al. Joint feature learning and relation modeling for tracking: a one-stream framework [C]// European Conference on Computer Vision. Tel Aviv: Springer, 2022: 341–357.
[27]   WANG Y, DENG L, ZHENG Y, et al Temporal convolutional network with soft thresholding and attention mechanism for machinery prognostics[J]. Journal of Manufacturing Systems, 2021, 60 (1): 512- 526
[28]   FAN H, LIN L T, YANG F, et al. LaSOT: a high-quality benchmark for large-scale single object tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 5369–5378.
[29]   HUANG L, ZHAO X, HUANG K GOT-10k: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43 (5): 1562- 1577
doi: 10.1109/TPAMI.2019.2957464
[30]   CUI Y, ZENG C, ZHAO X, et al. SportsMOT: a large multi-object tracking dataset in multiple sports scenes [C]// IEEE/CVF International Conference on Computer Vision. Paris: IEEE, 2023: 9887–9897.
[31]   HE K, CHEN X, XIE S, et al. Masked autoencoders are scalable vision learners [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 15979–15988.
[1] Fujian WANG,Zetian ZHANG,Xiqun CHEN,Dianhai WANG. Usage prediction of shared bike based on multi-channel graph aggregation attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1986-1995.
[2] Yahong ZHAI,Yaling CHEN,Longyan XU,Yu GONG. Improved YOLOv8s lightweight small target detection algorithm of UAV aerial image[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1708-1717.
[3] Jiarui FU,Zhaofei LI,Hao ZHOU,Wei HUANG. Camouflaged object detection based on Convnextv2 and texture-edge guidance[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1718-1726.
[4] Xuejun ZHANG,Shubin LIANG,Wanrong BAI,Fenghe ZHANG,Haiyan HUANG,Meifeng GUO,Zhuo CHEN. Source code vulnerability detection method based on heterogeneous graph representation[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1644-1652.
[5] Yishan LIN,Jing ZUO,Shuhua LU. Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1653-1661.
[6] Rongtai YANG,Yubin SHAO,Qingzhi DU. Structure-aware model for few-shot knowledge completion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1394-1402.
[7] Shengju WANG,Zan ZHANG. Missing value imputation algorithm based on accelerated diffusion model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1471-1480.
[8] Yongqing CAI,Cheng HAN,Wei QUAN,Wudi CHEN. Visual induced motion sickness estimation model based on attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1110-1118.
[9] Wenbo JU,Huajun DONG. Motherboard defect detection method based on context information fusion and dynamic sampling[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1159-1168.
[10] Xiangyu ZHOU,Yizhi LIU,Yijiang ZHAO,Zhuhua LIAO,Decheng ZHANG. Hierarchical spatial embedding BiGRU model for destination prediction[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1211-1218.
[11] Zongmin LI,Chang XU,Yun BAI,Shiyang XIAN,Guangcai RONG. Dual-neighborhood graph convolution method for point cloud understanding[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 879-889.
[12] Hongwei LIU,Lei WANG,Yang LIU,Pengchao ZHANG,Shi QIAO. Short term load forecasting based on recombination quadratic decomposition and LSTNet-Atten[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(5): 1051-1062.
[13] Dengfeng LIU,Wenjing GUO,Shihai CHEN. Content-guided attention-based lane detection network[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 451-459.
[14] Minghui YAO,Yueyan WANG,Qiliang WU,Yan NIU,Cong WANG. Siamese networks algorithm based on small human motion behavior recognition[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 504-511.
[15] Xianglei YIN,Shaopeng QU,Yongfang XIE,Ni SU. Occluded bird nest detection based on asymptotic feature fusion and multi-scale dilated attention[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(3): 535-545.