Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (8): 1634-1643    DOI: 10.3785/j.issn.1008-973X.2025.08.010
计算机技术、控制工程、通信技术     
基于对比学习的可扩展交通图像自动标注方法
侯越(),李前辉,袁鹏,张鑫,王甜甜,郝紫微
兰州交通大学 电子与信息工程学院,甘肃 兰州 730000
Scalable traffic image auto-annotation method based on contrastive learning
Yue HOU(),Qianhui LI,Peng YUAN,Xin ZHANG,Tiantian WANG,Ziwei HAO
School of Electronics and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
 全文: PDF(2904 KB)   HTML
摘要:

针对现有交通图像自动标注方法标注类别不可扩展、精度低的问题,提出基于模态间对比学习的可扩展交通图像自动标注方法. 该方法以文本和图像双模态数据为研究对象,通过对比学习捕获模态间特征的相似关系,采用模态间特征增强策略优化跨模态数据的有效对齐. 在文本特征提取阶段,提出文本距离融合编码模块,通过构建距离感知特征融合组件,增强文本序列的局部特征表达能力. 在图像特征提取阶段,设计可变形过滤卷积结构,在增强不规则目标识别能力的同时,有效过滤噪声信息. 建立组合对比损失函数,改进原有的损失结构,提升模态间正、负样本的区分度. 实验结果表明,相较于同类规模的其他模型,所提模型在BIT车辆数据集上的mAP0.5和mAP0.5:0.95分别提升了5.3%、4.8%,在交通图像自动标注方面,表现更优.

关键词: 图像标注对比学习双模态类别可扩展交通视频图像    
Abstract:

An expandable automatic annotation method for traffic images based on cross-modal contrastive learning was proposed aiming at the problems of non-scalable annotation categories and low accuracy in existing automatic annotation methods for traffic images. Dual-modal data comprising text and images were adopted as the research subjects, and the similarity relationships between modalities were captured through contrastive learning. An inter-modal feature enhancement strategy was employed to optimize the effective alignment of cross-modal data. A text-distance fusion encoding module was proposed in the text feature extraction stage, which enhanced the local feature representation capability of text sequences by constructing a distance-aware feature fusion component. A deformable filtering convolution structure was designed for image feature extraction, which effectively enhanced the recognition of irregular objects while filtering out noise information. The original loss structure was improved by establishing a combined contrastive loss function to enhance the discriminative ability between positive and negative cross-modal samples. The experimental results demonstrate that the proposed model achieves an improvement of 5.3% and 4.8% in mAP0.5 and mAP0.5:0.95 respectively on the BIT vehicle dataset compared with other models of similar scale, exhibiting superior performance in the automatic annotation of traffic images.

Key words: image annotation    contrastive learning    dual-modality    category expandability    traffic video image
收稿日期: 2024-08-19 出版日期: 2025-07-28
:  TP 391  
基金资助: 国家自然科学基金资助项目(62063014, 62363020);甘肃省自然科学基金资助项目(22JR5RA365).
作者简介: 侯越(1979—),女,教授,从事大数据智能交通的研究. orcid.org/0000-0002-8289-329X. E-mail:houyue@mail.lzjtu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
侯越
李前辉
袁鹏
张鑫
王甜甜
郝紫微

引用本文:

侯越,李前辉,袁鹏,张鑫,王甜甜,郝紫微. 基于对比学习的可扩展交通图像自动标注方法[J]. 浙江大学学报(工学版), 2025, 59(8): 1634-1643.

Yue HOU,Qianhui LI,Peng YUAN,Xin ZHANG,Tiantian WANG,Ziwei HAO. Scalable traffic image auto-annotation method based on contrastive learning. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1634-1643.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.08.010        https://www.zjujournals.com/eng/CN/Y2025/V59/I8/1634

图 1  SIAM-CML模型的框架图
图 2  增强的标签语义描述
图 3  文本距离融合编码模块
图 4  特征融合组件
图 5  可变形视觉编码模块
图 6  单个过滤池卷积块
图 7  训练初期利用不同方法截取的区域图像
模型AP0.5/%mAP0.5/%mAP0.5:0.95/%
busmicrobusminivansedansuvtruck
Yolov4[20]94.193.690.392.390.593.292.376.7
SSD[12]95.394.791.693.292.196.193.877.1
DERT[21]96.896.393.795.594.496.595.679.6
Faster RCNN[5]97.096.594.295.894.196.595.779.9
Cascade RCNN[7]97.897.294.396.794.998.196.580.4
Yolov5s[22]96.395.792.894.693.996.795.078.5
Yolov8s[23]98.397.895.197.595.798.297.181.2
SIAM-CML99.198.794.898.995.298.997.681.5
表 1  不同算法在BIT交通数据集上的实验结果
模型AP0.5/%mAP0.5/%mAP0.5:0.95/%
carbusvanothers
Yolov482.180.371.569.775.954.5
SSD85.384.773.175.779.758.8
DERT88.487.674.282.983.362.5
Faster RCNN88.787.974.383.183.562.9
Cascade RCNN90.398.775.183.384.663.7
Yolov5s88.187.374.282.483.061.6
Yolov8s93.492.675.587.387.267.5
SIAM-CML92.591.774.886.686.466.9
表 2  不同算法在UA-DETRAC数据集上的实验结果
图 8  UA-DETRAC数据集标注结果的可视化
模型文本编码模块视觉编码模块对比学习指标
SIAM-CMLTinybertMLPTDFCDCNFP2* CLSLLNLmAP0.5/%mAP0.5:0.95/%
(a)92.176.5
(b)92.6(+0.5)77.2
(c)95.2(+3.1)78.6
(d)95.7(+3.6)79.3
(e)96.8(+4.7)80.8
(f)97.6(+5.5)81.5
(g)97.0(+4.9)80.9
表 3  消融实验结果
图 9  增量学习实验的损失
类别mAP0.5/%
car95.0
bus93.6
van86.9
others85.4
suv83.5
truck87.7
整体88.7
表 4  SIAM-CML模型的增量学习实验结果
图 10  对比学习分类可视化图
1 马艳春, 刘永坚, 解庆, 等 自动图像标注技术综述[J]. 计算机研究与发展, 2020, 57 (11): 2348- 2374
MA Yanchun, LIU Yongjian, XIE Qing, et al Review of automatic image annotation technology[J]. Journal of Computer Research and Development, 2020, 57 (11): 2348- 2374
2 史先进, 曹爽, 张重生, 等 基于锚点的字符级甲骨图像自动标注算法研究[J]. 电子学报, 2021, 49 (10): 2020- 2031
SHI Xianjin, CAO Shuang, ZHANG Chongsheng, et al Research on automatic annotation algorithm for character-level oracle-bone images based on anchor points[J]. Acta Electronica Sinica, 2021, 49 (10): 2020- 2031
3 GIRSHICK R, DONAHUE J, DARRELL T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014: 580-587.
4 GIRSHICK R. Fast r-cnn [C]// Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 1440-1448.
5 REN S, HE K, GIRSHICK R, et al Faster r-cnn: towards real-time object detection with region proposal networks[J]. Transactions on Pattern Analysis and Machine Intelligence, 2016, 39 (6): 1137- 1149
6 HE K, GKIOXARI G, DOLLÁR P, et al. Mask r-cnn [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2961-2969.
7 CAI Z, VASCONCELOS N. Cascade r-cnn: delving into high quality object detection [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6154-6162.
8 REDMON J, DIVVALA S, GIRSHICK R, et al. You only look once: unified, real-time object detection [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2016: 779-788.
9 REDMON J, FARHADI A. YOLO9000: better, faster, stronger [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017: 7263-7271.
10 REDMON J, FARHADI A. Yolov3: an incremental improvement [EB/OL]. (2018-04-08) [2024-08-15]. https://arxiv.org/abs/1804.02767.
11 WANG C Y, BOCHKOVSKIY A, LIAO H Y M. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 7464-7475.
12 LIU Wei, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector [C]// 14th European Conference on Computer Vision. Amsterdam: Springer, 2016: 21-37.
13 谢禹, 李玉俊, 董文生 基于SSD神经网络的图像自动标注及应用研究[J]. 信息技术与标准化, 2020, (4): 38- 42
XIE Yu, LI Yujun, DONG Wensheng Automatic image annotation and applied research based on SSD deep neural network[J]. Information Technology and Standardization, 2020, (4): 38- 42
14 乔人杰, 蔡成涛 对鱼眼图像的FastSAM多点标注算法[J]. 哈尔滨工程大学学报, 2024, 45 (8): 1427- 1433
QIAO Renjie, CAI Chengtao Research on FastSAM multi-point annotation algorithm for fisheye images[J]. Journal of Harbin Engineering University, 2024, 45 (8): 1427- 1433
15 RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision [C]// International Conference on Machine Learning. Vienna: PMLR, 2021: 8748-8763.
16 ZHONG Y, YANG J, ZHANG P, et al. Regionclip: region-based language-image pretraining [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 16793-16803.
17 YANG J, LI C, ZHANG P, et al. Unified contrastive learning in image-text-label space [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 19163-19173.
18 JIAO X, YIN Y, SHANG L, et al. Tinybert: distilling bert for natural language understanding [EB/OL]. (2020-10-16) [2024-08-15]. https://arxiv.org/abs/1909.10351.
19 DONG Z, WU Y, PEI M, et al Vehicle type classification using a semisupervised convolutional neural network[J]. IEEE Transactions on Intelligent Transportation Systems, 2015, 16 (4): 2247- 2256
doi: 10.1109/TITS.2015.2402438
20 BOCHKOVSKIY A, WANG C Y, LIAO H Y M. Yolov4: optimal speed and accuracy of object detection [EB/OL]. (2020-04-23)[2024-08-15]. https://arxiv.org/abs/1909.10351.
21 CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers [C]//European Conference on Computer Vision. Cham: Springer, 2020: 213-229.
22 ULTRALYTICS. YOLOv5 [EB/OL]. (2021-04-15)[2024-08-15]. https://github.com/ultralytics/yolov5.
[1] 陈珂,张文浩. 基于对比学习的零样本对象谣言检测[J]. 浙江大学学报(工学版), 2024, 58(9): 1790-1800.
[2] 付晓峰,陈威岐,孙曜,潘宇泽. 基于双向编码表示转换的双模态软件分类模型[J]. 浙江大学学报(工学版), 2024, 58(11): 2239-2246.
[3] 周天琪,杨艳,张继杰,殷少伟,郭增强. 基于无负样本损失和自适应增强的图对比学习[J]. 浙江大学学报(工学版), 2023, 57(2): 259-266.
[4] 黄鹏, 陈纯, 王灿, 等. 使用加权图像标注改进Web图像检索[J]. J4, 2009, 43(12): 2129-2135.