Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2024, Vol. 58 Issue (12): 2427-2437    DOI: 10.3785/j.issn.1008-973X.2024.12.002
    
Target tracking algorithm based on dynamic position encoding and attention enhancement
Changzhen XIONG(),Chuanxi GUO,Cong WANG
Beijing Key Laboratory of Urban Road Transportation Intelligent Control Technology, North China University of Technology, Beijing 100144, China
Download: HTML     PDF(1684KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A method based on dynamic position encoding and multi-domain attention feature enhancement was proposed to fully exploit the positional information between the template and search region and harness the feature representation capabilities. Firstly, a position encoding module with convolutional operations was embedded within the attention module. Position encoding was updated with attention calculations to enhance the utilization of spatial structural information. Next, a multi-domain attention enhancement module was introduced. Sampling was conducted in the spatial dimension using parallel convolutions with different dilation rates and strides to cope with targets of different sizes and aggregate the enhanced channel attention features. Finally, a spatial domain attention enhancement module was incorporated into the decoder to provide accurate classification and regression features for the prediction head. The proposed algorithm achieved an average overlap (AO) of 73.9% on the GOT-10K dataset. It attained area under the curve (AUC) scores of 82.7%, 69.3%, and 70.9% on the TrackingNet, UAV123, and OTB100 datasets, respectively. Comparative results with state-of-the-art algorithms demonstrated that the tracking model, which integrated dynamic position encoding as well as channel and spatial attention enhancement, effectively enhanced the interaction of information between the template and search region, leading to improved tracking accuracy.



Key wordstransformer      attention mechanism      object tracking      positional encoding      siamese network     
Received: 01 November 2023      Published: 25 November 2024
CLC:  TP 391.4  
Fund:  车路一体智能交通全国重点实验室开放基金资助项目(2024-A001);国家重点研发计划资助项目(2022YFB4300400).
Cite this article:

Changzhen XIONG,Chuanxi GUO,Cong WANG. Target tracking algorithm based on dynamic position encoding and attention enhancement. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2427-2437.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.12.002     OR     https://www.zjujournals.com/eng/Y2024/V58/I12/2427


基于动态位置编码和注意力增强的目标跟踪算法

为了充分利用模板和搜索区域之间的位置信息以及提高融合特征的表征能力,提出使用动态位置编码和多域注意力特征增强的方法. 在注意力模块内部嵌入带有卷积操作的位置编码模块,随注意力计算更新位置编码,提高自身空间结构信息的利用率. 引入多域注意力增强模块,在空间维度上使用不同空洞率和步长的平行卷积进行采样,以应对不同大小的目标物,并聚合通道注意力增强后的特征. 在解码器中加入空间域注意力增强模块,为预测头提供更精确的分类回归特征. 本算法在GOT-10K数据集上的平均重叠度(AO)为73.9%;在TrackingNet、UAV123和OTB100数据集上分别取得了82.7%、69.3%和70.9%的曲线下面积(AUC). 与主流算法的对比结果表明,融合了动态位置编码和通道、空间注意力增强的跟踪模型可以有效提升模板和搜索区域间的信息交互,提高跟踪的精度.


关键词: transformer,  注意力机制,  目标跟踪模型,  位置编码,  孪生网络 
Fig.1 Overall framework of proposed algorithm
Fig.2 Dynamic position encoding module
Fig.3 Self-attention feature enhancement and cross-attention feature fusion module with dynamic positional encoding
Fig.4 Multi-domain attention enhancement module
Fig.5 Decoder module
TrackersGOT-10KTrackingNetUAV123
AO/%SR0.50/%SR0.75/%AUC/%PNorm/%P/%AUC/%P/%
SiamFC[3]34.835.39.857.166.353.348.569.3
SiamPRN++[21]51.761.632.573.380.069.464.284.0
ATOM[22]55.663.440.270.377.164.864.3
Ocean[7]61.172.147.362.182.3
DiMP[23]61.171.749.274.080.168.765.485.8
KYS[24]63.675.151.574.080.068.8
DTT[25]63.474.951.479.685.078.9
PrDiMP[26]63.473.854.375.881.670.466.987.8
TrSiam[9]66.076.657.178.182.972.7
TrDimp[9]67.177.758.378.483.373.167.5
KeepTrack[13]68.379.361.078.183.573.869.7
STARK[27]68.878.164.182.086.9
TransT[10]72.382.468.281.486.780.369.1
CTT[28]81.486.4
TCTrack[12]60.480.0
AiATrack[11]69.677.763.282.787.880.469.390.7
ToMP[14]73.585.666.581.586.478.965.985.2
MixFormer[15]73.283.270.282.687.781.268.789.5
本研究算法73.983.368.682.787.780.869.390.5
Tab.1 Comparison of different algorithms on GOT-10K, TrackingNet and UAV123
Fig.6 AUC performance of different attributes on LaSOT dataset
Fig.7 Success rates, accuracy, and execution speeds of different algorithms on OTB100 dataset
Fig.8 Performance of different algorithms in four scenarios
Fig.9 Feature map visualization of feature fusion and feature decoding
模块AO/%SR0.50/%SR0.75/%
ToMP[14]71.983.166.7
ToMP[14]+DPE72.583.466.8
OSTrack[29]73.182.570.9
OSTrack[29]+DPE73.682.870.8
Tab.2 Performance of ToMP and OSTrack with DPE embedded on GOT-10K
Fig.10 Feature map visualization of ToMP, OSTrack with DPE embedded
BaseDPEMDAEDecoderAO/%fps/(帧·s?1
SDACDA
71.740.1
72.836.7
73.435.7
72.638.6
73.934.5
73.935.1
Tab.3 Results of ablation experiment on GOT-10K
[1]   韩瑞泽, 冯伟, 郭青, 等 视频单目标跟踪研究进展综述[J]. 计算机学报, 2022, 45 (9): 1877- 1907
HAN Ruize, FENG Wei, GUO Qing, et al Single object tracking research: a survey[J]. Chinese Journal of Computers, 2022, 45 (9): 1877- 1907
doi: 10.11897/SP.J.1016.2022.01877
[2]   卢湖川, 李佩霞, 王栋 目标跟踪算法综述[J]. 模式识别与人工智能, 2018, 31 (1): 61- 76
LU Huchuan, LI Peixia, WANG Dong Visual object tracking: a survey[J]. Pattern Recognition and Artificial Intelligence, 2018, 31 (1): 61- 76
[3]   BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking [C]// 14th European Conference on Computer Vision . Amsterdam: Springer, 2016: 850–865.
[4]   陈志旺, 张忠新, 宋娟, 等 基于目标感知特征筛选的孪生网络跟踪算法[J]. 光学学报, 2020, 40 (9): 110- 126
CHEN Zhiwang, ZHANG Zhongxin, SONG Juan, et al Tracking algorithm for siamese network based on target-aware feature selection[J]. Acta Optica Sinica, 2020, 40 (9): 110- 126
[5]   陈法领, 丁庆海, 罗海波, 等 基于自适应多层卷积特征决策融合的目标跟踪[J]. 光学学报, 2020, 40 (23): 175- 187
CHEN Faling, DING Qinghai, LUO Haibo, et al Target tracking based on adaptive multilayer convolutional feature decision fusion[J]. Acta Optica Sinica, 2020, 40 (23): 175- 187
[6]   LI B, YAN J, WU W, et al. High performance visual tracking with Siamese region proposal network [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2018: 8971–8980.
[7]   ZHANG Z, PENG H, FU J, et al. ocean: object-aware anchor-free tracking [C]// 16th European Conference on Computer Vision . Glasgow : Springer, 2020: 771–787.
[8]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st Annual Conference on Neural Information Processing Systems . Long Beach: IEEE, 2017: 5998–6010.
[9]   WANG N, ZHOU W, WANG J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . [s.l.]: IEEE, 2021: 1571–1580.
[10]   CHEN X, YAN B, ZHU J, et al. Transformer tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . [s.l.]: IEEE, 2021: 8126–8135.
[11]   GAO S, ZHOU C, MA C, et al. Aiatrack: attention in attention for transformer visual tracking [C]// 17th European Conference on Computer Vision . Tel Aviv: Springer, 2022: 146–164.
[12]   CAO Z, HUANG Z, PAN L, et al. TCTrack: temporal contexts for aerial tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 14798–14808.
[13]   MAYER C, DANELLJAN M, PAUDEL D P, et al. Learning target candidate association to keep track of what not to track [C]// 18th IEEE/CVF International Conference on Computer Vision . [s.l.]: IEEE, 2021: 13444–1345.
[14]   MAYER C, DANELLJAN M, BHAT G, et al. Transforming model prediction for tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 8731–8740.
[15]   CUI Y, JIANG C, WANG L, et al. Mixformer: end-to-end tracking with iterative mixed attention [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 13608–13618.
[16]   WU Q, YANG T, LIU Z, et al. Dropmae: masked autoencoders with spatial-attention dropout for tracking tasks [C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2023: 14561–14571.
[17]   CHEN X, PENG H, WANG D, et al. Seqtrack: sequence to sequence learning for visual object tracking [C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2023: 14572–14581.
[18]   CHU X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers. (2021-02-22)[2023-10-10]. https://www.arxiv.org/abs/2102.10882v2.
[19]   WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// 15th European Conference on Computer Vision . Munich: Springer, 2018: 3–19.
[20]   WANG C, XU H, ZHANG X, et al. Convolutional embedding makes hierarchical vision transformer stronger [C]// 17th European Conference on Computer Vision . Tel Aviv: Springer, 2022: 739–756.
[21]   LI B, WU W, WANG Q, et al. Siamrpn++: evolution of siamese visual tracking with very deep networks [C]// 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 15–20.
[22]   DANELLJAN M, BHAT G, KHAN F S, et al. ATOM: accurate tracking by overlap maximization [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 4660–4669.
[23]   BHAT G, DANELLJAN M, GOOL L V, et al. Learning discriminative model prediction for tracking [C]// IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 6182–6191.
[24]   BHAT G, DANELLJAN M, VAN G L, et al. Know your surroundings: exploiting scene information for object tracking [C]// 16th European Conference on Computer Vision . Glasgow: Springer, 2020: 205–221.
[25]   YU B, TANG M, ZHENG L, et al. High-performance discriminative tracking with transformers [C]// 18th IEEE/CVF International Conference on Computer Vision . [s.l.]: IEEE, 2021: 9856–9865.
[26]   DANELLJAN M, GOOL L V, TIMOFTE R. Probabilistic regression for visual tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . [s.l.]: IEEE, 2020: 7183–7192.
[27]   YAN B, PENG H, FU J, et al. Learning spatio-temporal transformer for visual tracking [C]// 18th IEEE/CVF International Conference on Computer Vision . [s.l.]: IEEE, 2021: 10448–10457.
[28]   ZHONG M, CHEN F, XU J, et al. Correlation-based transformer tracking [C]// 31st International Conference on Artificial Neural Networks . Bristol: European Neural Networks Soc, 2022: 85–96.
[1] Canlin LI,Xinyue WANG,Lizhuang MA,Zhiwen SHAO,Wenjiao ZHANG. Image cartoonization incorporating attention mechanism and structural line extraction[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(8): 1728-1737.
[2] Zhongliang LI,Qi CHEN,Lin SHI,Chao YANG,Xianming ZOU. Dynamic knowledge graph completion of temporal aware combination[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(8): 1738-1747.
[3] Shuhan WU,Dan WANG,Yuanfang CHEN,Ziyu JIA,Yueqi ZHANG,Meng XU. Attention-fused filter bank dual-view graph convolution motor imagery EEG classification[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1326-1335.
[4] Xianwei MA,Chaohui FAN,Weizhi NIE,Dong LI,Yiqun ZHU. Robust fault diagnosis method for failure sensors[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1488-1497.
[5] Jun YANG,Chen ZHANG. Semantic segmentation of 3D point cloud based on boundary point estimation and sparse convolution neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1121-1132.
[6] Yuntang LI,Hengjie LI,Kun ZHANG,Binrui WANG,Shanyue GUAN,Yuan CHEN. Recognition of complex power lines based on novel encoder-decoder network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(6): 1133-1141.
[7] Zhiwei XING,Shujie ZHU,Biao LI. Airline baggage feature perception based on improved graph convolutional neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 941-950.
[8] Yi LIU,Yidan CHEN,Lin GAO,Jiao HONG. Lightweight road extraction model based on multi-scale feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 951-959.
[9] Cuiting WEI,Weijian ZHAO,Bochao SUN,Yunyi LIU. Intelligent rebar inspection based on improved Mask R-CNN and stereo vision[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 1009-1019.
[10] Kang FAN,Ming’en ZHONG,Jiawei TAN,Zehui ZHAN,Yan FENG. Traffic scene perception algorithm with joint semantic segmentation and depth estimation[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 684-695.
[11] Hai HUAN,Yu SHENG,Chenxi GU. Global guidance multi-feature fusion network based on remote sensing image road extraction[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(4): 696-707.
[12] Mingjun SONG,Wen YAN,Yizhao DENG,Junran ZHANG,Haiyan TU. Light-weight algorithm for real-time robotic grasp detection[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(3): 599-610.
[13] Shaojie WEN,Ruigang WU,Chaowen FENG,Yingli LIU. Multimodal cascaded document layout analysis network based on Transformer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 317-324.
[14] Xinhua YAO,Tao YU,Senwen FENG,Zijian MA,Congcong LUAN,Hongyao SHEN. Recognition method of parts machining features based on graph neural network[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 349-359.
[15] Yaolian SONG,Can WANG,Dayan LI,Xinyi LIU. UAV small target detection algorithm based on improved YOLOv5s[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2417-2426.