Please wait a minute...
浙江大学学报(工学版)  2022, Vol. 56 Issue (2): 236-244    DOI: 10.3785/j.issn.1008-973X.2022.02.003
计算机与控制工程     
基于注意力机制和编码-解码架构的施工场景图像描述方法
农元君(),王俊杰*(),陈红,孙文涵,耿慧,李书悦
中国海洋大学 工程学院,山东 青岛 266100
A image caption method of construction scene based on attention mechanism and encoding-decoding architecture
Yuan-jun NONG(),Jun-jie WANG*(),Hong CHEN,Wen-han SUN,Hui GENG,Shu-yue LI
School of Engineering, Ocean University of China, Qingdao 266100, China
 全文: PDF(1127 KB)   HTML
摘要:

为了实现在光线不佳、夜间施工、远距离密集小目标等复杂施工场景下的图像描述,提出基于注意力机制和编码-解码架构的施工场景图像描述方法. 采用卷积神经网络构建编码器,提取施工图像中丰富的视觉特征;利用长短时记忆网络搭建解码器,捕捉句子内部单词之间的语义特征,学习图像特征与单词语义特征之间的映射关系;引入注意力机制,关注显著性强的特征,抑制非显著性特征,减少噪声信息的干扰. 为了验证所提方法的有效性,构建一个包含10种常见施工场景的图像描述数据集. 实验结果表明,所提方法取得了较高的精度,在光线不佳、夜间施工、远距离密集小目标等复杂施工场景下具有良好的图像描述性能,且具有较强的泛化性和适应性.

关键词: 图像描述施工场景注意力机制编码解码    
Abstract:

A construction scene image caption method based on attention mechanism and encoding-decoding architecture was proposed, in order to realize the image caption in the complex construction scenes such as poor light, night construction, long-distance dense small targets and so on. Convolutional neural network was used to construct encoder to extract rich visual features in construction images. Long short-term memory network was used to construct decoder to capture semantic features of words in sentences and learn mapping relationship between image features and semantic features of words. Attention mechanism was introduced to focus on significant features, suppress non-significant features and reduce interference of noise information. An image caption data set containing ten common construction scenes was constructed in order to verify the effectiveness of the proposed method. Experimental results show that the proposed method achieves high accuracy, has good image caption performance in complex construction scenes such as poor light, night construction, long-distance dense small targets and so on, and has strong generalization and adaptability.

Key words: image caption    construction scene    attention mechanism    encoding    decoding
收稿日期: 2021-04-09 出版日期: 2022-03-03
CLC:  TP 391  
基金资助: 山东省重点研发计划资助项目(2019GHY112081)
通讯作者: 王俊杰     E-mail: nyj@stu.ouc.edu.cn;wjj@ouc.edu.cn
作者简介: 农元君(1995—),男,硕士,从事深度学习、防灾减灾研究. orcid.org/0000-0002-0758-469X. E-mail: nyj@stu.ouc.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
农元君
王俊杰
陈红
孙文涵
耿慧
李书悦

引用本文:

农元君,王俊杰,陈红,孙文涵,耿慧,李书悦. 基于注意力机制和编码-解码架构的施工场景图像描述方法[J]. 浙江大学学报(工学版), 2022, 56(2): 236-244.

Yuan-jun NONG,Jun-jie WANG,Hong CHEN,Wen-han SUN,Hui GENG,Shu-yue LI. A image caption method of construction scene based on attention mechanism and encoding-decoding architecture. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 236-244.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2022.02.003        https://www.zjujournals.com/eng/CN/Y2022/V56/I2/236

图 1  施工场景图像描述模型系统框架
图 2  LSTM单元结构
图 3  注意力机制可视化
施工场景 数量 施工场景 数量
工人推/拉手推车 115 工人爬梯子作业 120
挖掘机挖土 120 工人佩戴安全帽 120
工人焊接铁制品 125 工人砌砖 130
工人拿电钻机作业 120 工人绑钢筋 120
工人在脚手架上作业 120 工人浇筑混凝土 110
表 1  10种常见的施工场景及相应的图像数量
图 4  施工场景图像描述数据集示例
方法 主干网络 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE_L CIDEr
NIC[17] VGG-16 0.725 0.542 0.386 0.295 0.248 0.531 0.854
Adaptive[18] VGG-16 0.738 0.556 0.403 0.319 0.259 0.545 0.887
Self-critic[19] ResNet-101 0.751 0.573 0.437 0.332 0.266 0.558 0.913
Up-down[20] ResNet-101 0.764 0.587 0.455 0.344 0.271 0.572 0.946
本研究方法 ResNet-101 0.783 0.608 0.469 0.357 0.293 0.586 0.962
表 2  不同方法在施工图像描述数据集中的实验结果
注意力机制 BLEU-1 METEOR ROUGE_L CIDEr
× 0.758 0.264 0.562 0.921
0.783 0.293 0.586 0.962
表 3  消融实验结果
图 5  检测结果可视化
图 6  泛化性检测结果
图 7  注意力机制结果可视化
1 WU J, CAI N, CHEN W, et al Automatic detection of hardhats worn by construction personnel: a deep learning approach and benchmark dataset[J]. Automation in Construction, 2019, 106: 102894
doi: 10.1016/j.autcon.2019.102894
2 NATH N D, BEHZADAN A H, PAAL S G Deep learning for site safety: real-time detection of personal protective equipment[J]. Automation in Construction, 2020, 112: 103085
doi: 10.1016/j.autcon.2020.103085
3 GUO Y, XU Y, LI S Dense construction vehicle detection based on orientation-aware feature fusion convolutional neural network[J]. Automation in Construction, 2020, 112: 103124
doi: 10.1016/j.autcon.2020.103124
4 LI Y, LU Y, CHEN J A deep learning approach for real-time rebar counting on the construction site based on YOLOv3 detector[J]. Automation in Construction, 2021, 124: 103602
doi: 10.1016/j.autcon.2021.103602
5 徐守坤, 倪楚涵, 吉晨晨, 等 一种基于安全帽佩戴检测的图像描述方法研究[J]. 小型微型计算机系统, 2020, 41 (4): 812- 819
XU Shou-kun, NI Chu-han, JI Chen-chen, et al Research on image caption method based on safety helmet wearing detection[J]. Journal of Chinese Computer Systems, 2020, 41 (4): 812- 819
doi: 10.3969/j.issn.1000-1220.2020.04.025
6 BANG S, KIM H Context-based information generation for managing UAV-acquired data using image captioning[J]. Automation in Construction, 2020, 112: 103116
doi: 10.1016/j.autcon.2020.103116
7 LIU H, WANG G, HUANG T, et al Manifesting construction activity scenes via image captioning[J]. Automation in Construction, 2020, 119: 103334
doi: 10.1016/j.autcon.2020.103334
8 XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// International Conference on Machine Learning. Cambridge: MIT, 2015: 2048-2057.
9 LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision. Berlin: Springer, 2014: 740-755.
10 HODOSH M, YOUNG P, HOCKENMAIER J Framing image description as a ranking task: data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47: 853- 899
doi: 10.1613/jair.3994
11 YOUNG P, LAI A, HODOSH M, et al From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67- 78
doi: 10.1162/tacl_a_00166
12 DUTTA A, ZISSERMAN A. The VIA annotation software for images, audio and video [EB/OL]. (2019-04-24)[2021-04-08]. https://arxiv.org/abs/1904.10699.
13 PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// Annual Meeting on Association for Computational Linguistics. Stroudsburg: ACL, 2002: 311-318.
14 BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments [C]// ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg: ACL, 2005: 65-72.
15 LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// ACL Workshop on Text Summarization Branches Out. Stroudsburg: ACL, 2004: 74-81.
16 VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2015: 4566-4575.
17 VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2015: 3156-3164.
18 LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2017: 3242-3250.
19 RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2017: 1179-1195.
[1] 鞠晓臣,赵欣欣,钱胜胜. 基于自注意力机制的桥梁螺栓检测算法[J]. 浙江大学学报(工学版), 2022, 56(5): 901-908.
[2] 白文超,韩希先,王金宝. 基于条件生成模型的高效近似查询处理框架[J]. 浙江大学学报(工学版), 2022, 56(5): 995-1005.
[3] 王友卫,童爽,凤丽洲,朱建明,李洋,陈福. 基于图卷积网络的归纳式微博谣言检测新方法[J]. 浙江大学学报(工学版), 2022, 56(5): 956-966.
[4] 张雪芹,李天任. 基于Cycle-GAN和改进DPN网络的乳腺癌病理图像分类[J]. 浙江大学学报(工学版), 2022, 56(4): 727-735.
[5] 许萌,王丹,李致远,陈远方. IncepA-EEGNet: 融合Inception网络和注意力机制的P300信号检测方法[J]. 浙江大学学报(工学版), 2022, 56(4): 745-753, 782.
[6] 柳长源,何先平,毕晓君. 融合注意力机制的高效率网络车型识别[J]. 浙江大学学报(工学版), 2022, 56(4): 775-782.
[7] 陈巧红,裴皓磊,孙麒. 基于视觉关系推理与上下文门控机制的图像描述[J]. 浙江大学学报(工学版), 2022, 56(3): 542-549.
[8] 刘英莉,吴瑞刚,么长慧,沈韬. 铝硅合金实体关系抽取数据集的构建方法[J]. 浙江大学学报(工学版), 2022, 56(2): 245-253.
[9] 董红召,方浩杰,张楠. 旋转框定位的多尺度再生物品目标检测算法[J]. 浙江大学学报(工学版), 2022, 56(1): 16-25.
[10] 王鑫,陈巧红,孙麒,贾宇波. 基于关系推理与门控机制的视觉问答方法[J]. 浙江大学学报(工学版), 2022, 56(1): 36-46.
[11] 刘兴,余建波. 注意力卷积GRU自编码器及其在工业过程监控的应用[J]. 浙江大学学报(工学版), 2021, 55(9): 1643-1651.
[12] 张楠,董红召,佘翊妮. 公交专用道条件下公交车辆轨迹的Seq2Seq预测[J]. 浙江大学学报(工学版), 2021, 55(8): 1482-1489.
[13] 陈智超,焦海宁,杨杰,曾华福. 基于改进MobileNet v2的垃圾图像分类算法[J]. 浙江大学学报(工学版), 2021, 55(8): 1490-1499.
[14] 雍子叶,郭继昌,李重仪. 融入注意力机制的弱监督水下图像增强算法[J]. 浙江大学学报(工学版), 2021, 55(3): 555-562.
[15] 陈涵娟,达飞鹏,盖绍彦. 基于竞争注意力融合的深度三维点云分类网络[J]. 浙江大学学报(工学版), 2021, 55(12): 2342-2351.