Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (12): 2427-2437    DOI: 10.3785/j.issn.1008-973X.2024.12.002
计算机技术     
基于动态位置编码和注意力增强的目标跟踪算法
熊昌镇(),郭传玺,王聪
北方工业大学 城市道路交通智能控制技术北京市重点实验室,北京 100144
Target tracking algorithm based on dynamic position encoding and attention enhancement
Changzhen XIONG(),Chuanxi GUO,Cong WANG
Beijing Key Laboratory of Urban Road Transportation Intelligent Control Technology, North China University of Technology, Beijing 100144, China
 全文: PDF(1684 KB)   HTML
摘要:

为了充分利用模板和搜索区域之间的位置信息以及提高融合特征的表征能力,提出使用动态位置编码和多域注意力特征增强的方法. 在注意力模块内部嵌入带有卷积操作的位置编码模块,随注意力计算更新位置编码,提高自身空间结构信息的利用率. 引入多域注意力增强模块,在空间维度上使用不同空洞率和步长的平行卷积进行采样,以应对不同大小的目标物,并聚合通道注意力增强后的特征. 在解码器中加入空间域注意力增强模块,为预测头提供更精确的分类回归特征. 本算法在GOT-10K数据集上的平均重叠度(AO)为73.9%;在TrackingNet、UAV123和OTB100数据集上分别取得了82.7%、69.3%和70.9%的曲线下面积(AUC). 与主流算法的对比结果表明,融合了动态位置编码和通道、空间注意力增强的跟踪模型可以有效提升模板和搜索区域间的信息交互,提高跟踪的精度.

关键词: transformer注意力机制目标跟踪模型位置编码孪生网络    
Abstract:

A method based on dynamic position encoding and multi-domain attention feature enhancement was proposed to fully exploit the positional information between the template and search region and harness the feature representation capabilities. Firstly, a position encoding module with convolutional operations was embedded within the attention module. Position encoding was updated with attention calculations to enhance the utilization of spatial structural information. Next, a multi-domain attention enhancement module was introduced. Sampling was conducted in the spatial dimension using parallel convolutions with different dilation rates and strides to cope with targets of different sizes and aggregate the enhanced channel attention features. Finally, a spatial domain attention enhancement module was incorporated into the decoder to provide accurate classification and regression features for the prediction head. The proposed algorithm achieved an average overlap (AO) of 73.9% on the GOT-10K dataset. It attained area under the curve (AUC) scores of 82.7%, 69.3%, and 70.9% on the TrackingNet, UAV123, and OTB100 datasets, respectively. Comparative results with state-of-the-art algorithms demonstrated that the tracking model, which integrated dynamic position encoding as well as channel and spatial attention enhancement, effectively enhanced the interaction of information between the template and search region, leading to improved tracking accuracy.

Key words: transformer    attention mechanism    object tracking    positional encoding    siamese network
收稿日期: 2023-11-01 出版日期: 2024-11-25
CLC:  TP 391.4  
基金资助: 车路一体智能交通全国重点实验室开放基金资助项目(2024-A001);国家重点研发计划资助项目(2022YFB4300400).
作者简介: 熊昌镇(1979—),男,副教授,从事计算机视觉、深度学习、视频分析方面的研究. orcid.org/0000-0001-7645-5181. E-mail:xczkiong@ncut.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
熊昌镇
郭传玺
王聪

引用本文:

熊昌镇,郭传玺,王聪. 基于动态位置编码和注意力增强的目标跟踪算法[J]. 浙江大学学报(工学版), 2024, 58(12): 2427-2437.

Changzhen XIONG,Chuanxi GUO,Cong WANG. Target tracking algorithm based on dynamic position encoding and attention enhancement. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2427-2437.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.12.002        https://www.zjujournals.com/eng/CN/Y2024/V58/I12/2427

图 1  所提算法整体框架
图 2  动态位置编码模块
图 3  融合动态位置编码的自注意力特征增强和交叉注意力特征融合模块
图 4  多域注意力增强模块
图 5  解码器模块
TrackersGOT-10KTrackingNetUAV123
AO/%SR0.50/%SR0.75/%AUC/%PNorm/%P/%AUC/%P/%
SiamFC[3]34.835.39.857.166.353.348.569.3
SiamPRN++[21]51.761.632.573.380.069.464.284.0
ATOM[22]55.663.440.270.377.164.864.3
Ocean[7]61.172.147.362.182.3
DiMP[23]61.171.749.274.080.168.765.485.8
KYS[24]63.675.151.574.080.068.8
DTT[25]63.474.951.479.685.078.9
PrDiMP[26]63.473.854.375.881.670.466.987.8
TrSiam[9]66.076.657.178.182.972.7
TrDimp[9]67.177.758.378.483.373.167.5
KeepTrack[13]68.379.361.078.183.573.869.7
STARK[27]68.878.164.182.086.9
TransT[10]72.382.468.281.486.780.369.1
CTT[28]81.486.4
TCTrack[12]60.480.0
AiATrack[11]69.677.763.282.787.880.469.390.7
ToMP[14]73.585.666.581.586.478.965.985.2
MixFormer[15]73.283.270.282.687.781.268.789.5
本研究算法73.983.368.682.787.780.869.390.5
表 1  GOT-10K、TrackingNet、UAV123上不同算法的对比
图 6  在LaSOT数据集中不同属性上的AUC表现
图 7  OTB100上不同算法的成功率、精度和运行速度
图 8  不同算法在4个场景下的表现
图 9  特征融合及解码特征图可视化
模块AO/%SR0.50/%SR0.75/%
ToMP[14]71.983.166.7
ToMP[14]+DPE72.583.466.8
OSTrack[29]73.182.570.9
OSTrack[29]+DPE73.682.870.8
表 2  ToMP、OSTrack中嵌入DPE在GOT-10K上的表现
图 10  ToMP、OSTrack嵌入DPE的特征图可视化展示
BaseDPEMDAEDecoderAO/%fps/(帧·s?1
SDACDA
71.740.1
72.836.7
73.435.7
72.638.6
73.934.5
73.935.1
表 3  GOT-10K数据集上消融实验结果
1 韩瑞泽, 冯伟, 郭青, 等 视频单目标跟踪研究进展综述[J]. 计算机学报, 2022, 45 (9): 1877- 1907
HAN Ruize, FENG Wei, GUO Qing, et al Single object tracking research: a survey[J]. Chinese Journal of Computers, 2022, 45 (9): 1877- 1907
doi: 10.11897/SP.J.1016.2022.01877
2 卢湖川, 李佩霞, 王栋 目标跟踪算法综述[J]. 模式识别与人工智能, 2018, 31 (1): 61- 76
LU Huchuan, LI Peixia, WANG Dong Visual object tracking: a survey[J]. Pattern Recognition and Artificial Intelligence, 2018, 31 (1): 61- 76
3 BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-convolutional siamese networks for object tracking [C]// 14th European Conference on Computer Vision . Amsterdam: Springer, 2016: 850–865.
4 陈志旺, 张忠新, 宋娟, 等 基于目标感知特征筛选的孪生网络跟踪算法[J]. 光学学报, 2020, 40 (9): 110- 126
CHEN Zhiwang, ZHANG Zhongxin, SONG Juan, et al Tracking algorithm for siamese network based on target-aware feature selection[J]. Acta Optica Sinica, 2020, 40 (9): 110- 126
5 陈法领, 丁庆海, 罗海波, 等 基于自适应多层卷积特征决策融合的目标跟踪[J]. 光学学报, 2020, 40 (23): 175- 187
CHEN Faling, DING Qinghai, LUO Haibo, et al Target tracking based on adaptive multilayer convolutional feature decision fusion[J]. Acta Optica Sinica, 2020, 40 (23): 175- 187
6 LI B, YAN J, WU W, et al. High performance visual tracking with Siamese region proposal network [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2018: 8971–8980.
7 ZHANG Z, PENG H, FU J, et al. ocean: object-aware anchor-free tracking [C]// 16th European Conference on Computer Vision . Glasgow : Springer, 2020: 771–787.
8 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st Annual Conference on Neural Information Processing Systems . Long Beach: IEEE, 2017: 5998–6010.
9 WANG N, ZHOU W, WANG J, et al. Transformer meets tracker: exploiting temporal context for robust visual tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . [s.l.]: IEEE, 2021: 1571–1580.
10 CHEN X, YAN B, ZHU J, et al. Transformer tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . [s.l.]: IEEE, 2021: 8126–8135.
11 GAO S, ZHOU C, MA C, et al. Aiatrack: attention in attention for transformer visual tracking [C]// 17th European Conference on Computer Vision . Tel Aviv: Springer, 2022: 146–164.
12 CAO Z, HUANG Z, PAN L, et al. TCTrack: temporal contexts for aerial tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 14798–14808.
13 MAYER C, DANELLJAN M, PAUDEL D P, et al. Learning target candidate association to keep track of what not to track [C]// 18th IEEE/CVF International Conference on Computer Vision . [s.l.]: IEEE, 2021: 13444–1345.
14 MAYER C, DANELLJAN M, BHAT G, et al. Transforming model prediction for tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 8731–8740.
15 CUI Y, JIANG C, WANG L, et al. Mixformer: end-to-end tracking with iterative mixed attention [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . New Orleans: IEEE, 2022: 13608–13618.
16 WU Q, YANG T, LIU Z, et al. Dropmae: masked autoencoders with spatial-attention dropout for tracking tasks [C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2023: 14561–14571.
17 CHEN X, PENG H, WANG D, et al. Seqtrack: sequence to sequence learning for visual object tracking [C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition . New York: IEEE, 2023: 14572–14581.
18 CHU X, TIAN Z, ZHANG B, et al. Conditional positional encodings for vision transformers. (2021-02-22)[2023-10-10]. https://www.arxiv.org/abs/2102.10882v2.
19 WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// 15th European Conference on Computer Vision . Munich: Springer, 2018: 3–19.
20 WANG C, XU H, ZHANG X, et al. Convolutional embedding makes hierarchical vision transformer stronger [C]// 17th European Conference on Computer Vision . Tel Aviv: Springer, 2022: 739–756.
21 LI B, WU W, WANG Q, et al. Siamrpn++: evolution of siamese visual tracking with very deep networks [C]// 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 15–20.
22 DANELLJAN M, BHAT G, KHAN F S, et al. ATOM: accurate tracking by overlap maximization [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach: IEEE, 2019: 4660–4669.
23 BHAT G, DANELLJAN M, GOOL L V, et al. Learning discriminative model prediction for tracking [C]// IEEE/CVF International Conference on Computer Vision . Seoul: IEEE, 2019: 6182–6191.
24 BHAT G, DANELLJAN M, VAN G L, et al. Know your surroundings: exploiting scene information for object tracking [C]// 16th European Conference on Computer Vision . Glasgow: Springer, 2020: 205–221.
25 YU B, TANG M, ZHENG L, et al. High-performance discriminative tracking with transformers [C]// 18th IEEE/CVF International Conference on Computer Vision . [s.l.]: IEEE, 2021: 9856–9865.
26 DANELLJAN M, GOOL L V, TIMOFTE R. Probabilistic regression for visual tracking [C]// IEEE/CVF Conference on Computer Vision and Pattern Recognition . [s.l.]: IEEE, 2020: 7183–7192.
27 YAN B, PENG H, FU J, et al. Learning spatio-temporal transformer for visual tracking [C]// 18th IEEE/CVF International Conference on Computer Vision . [s.l.]: IEEE, 2021: 10448–10457.
28 ZHONG M, CHEN F, XU J, et al. Correlation-based transformer tracking [C]// 31st International Conference on Artificial Neural Networks . Bristol: European Neural Networks Soc, 2022: 85–96.
[1] 李灿林,王新玥,马利庄,邵志文,张文娇. 融合注意力机制和结构线提取的图像卡通化[J]. 浙江大学学报(工学版), 2024, 58(8): 1728-1737.
[2] 李忠良,陈麒,石琳,杨朝,邹先明. 时间感知组合的动态知识图谱补全[J]. 浙江大学学报(工学版), 2024, 58(8): 1738-1747.
[3] 吴书晗,王丹,陈远方,贾子钰,张越棋,许萌. 融合注意力的滤波器组双视图图卷积运动想象脑电分类[J]. 浙江大学学报(工学版), 2024, 58(7): 1326-1335.
[4] 马现伟,范朝辉,聂为之,李东,朱逸群. 对失效传感器具备鲁棒性的故障诊断方法[J]. 浙江大学学报(工学版), 2024, 58(7): 1488-1497.
[5] 杨军,张琛. 基于边界点估计与稀疏卷积神经网络的三维点云语义分割[J]. 浙江大学学报(工学版), 2024, 58(6): 1121-1132.
[6] 李运堂,李恒杰,张坤,王斌锐,关山越,陈源. 基于新型编码解码网络的复杂输电线识别[J]. 浙江大学学报(工学版), 2024, 58(6): 1133-1141.
[7] 刘议丹,朱小飞,尹雅博. 基于异质图卷积神经网络的论点对抽取模型[J]. 浙江大学学报(工学版), 2024, 58(5): 900-907.
[8] 邢志伟,朱书杰,李彪. 基于改进图卷积神经网络的航空行李特征感知[J]. 浙江大学学报(工学版), 2024, 58(5): 941-950.
[9] 刘毅,陈一丹,高琳,洪姣. 基于多尺度特征融合的轻量化道路提取模型[J]. 浙江大学学报(工学版), 2024, 58(5): 951-959.
[10] 魏翠婷,赵唯坚,孙博超,刘芸怡. 基于改进Mask R-CNN与双目视觉的智能配筋检测[J]. 浙江大学学报(工学版), 2024, 58(5): 1009-1019.
[11] 范康,钟铭恩,谭佳威,詹泽辉,冯妍. 联合语义分割和深度估计的交通场景感知算法[J]. 浙江大学学报(工学版), 2024, 58(4): 684-695.
[12] 宦海,盛宇,顾晨曦. 基于遥感图像道路提取的全局指导多特征融合网络[J]. 浙江大学学报(工学版), 2024, 58(4): 696-707.
[13] 宋明俊,严文,邓益昭,张俊然,涂海燕. 轻量化机器人抓取位姿实时检测算法[J]. 浙江大学学报(工学版), 2024, 58(3): 599-610.
[14] 温绍杰,吴瑞刚,冯超文,刘英莉. 基于Transformer的多模态级联文档布局分析网络[J]. 浙江大学学报(工学版), 2024, 58(2): 317-324.
[15] 姚鑫骅,于涛,封森文,马梓健,栾丛丛,沈洪垚. 基于图神经网络的零件机加工特征识别方法[J]. 浙江大学学报(工学版), 2024, 58(2): 349-359.