A image caption method of construction scene based on attention mechanism and encoding-decoding architecture

doi:10.3785/j.issn.1008-973X.2022.02.003

Journal of ZheJiang University (Engineering Science)

2022, Vol. 56

Issue (2): 236-244 DOI: 10.3785/j.issn.1008-973X.2022.02.003

A image caption method of construction scene based on attention mechanism and encoding-decoding architecture

Yuan-jun NONG(

),Jun-jie WANG*(

),Hong CHEN,Wen-han SUN,Hui GENG,Shu-yue LI

School of Engineering, Ocean University of China, Qingdao 266100, China

Download:

HTML

PDF(1127KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A construction scene image caption method based on attention mechanism and encoding-decoding architecture was proposed, in order to realize the image caption in the complex construction scenes such as poor light, night construction, long-distance dense small targets and so on. Convolutional neural network was used to construct encoder to extract rich visual features in construction images. Long short-term memory network was used to construct decoder to capture semantic features of words in sentences and learn mapping relationship between image features and semantic features of words. Attention mechanism was introduced to focus on significant features, suppress non-significant features and reduce interference of noise information. An image caption data set containing ten common construction scenes was constructed in order to verify the effectiveness of the proposed method. Experimental results show that the proposed method achieves high accuracy, has good image caption performance in complex construction scenes such as poor light, night construction, long-distance dense small targets and so on, and has strong generalization and adaptability.

Key words： image caption construction scene attention mechanism encoding decoding

Received: 09 April 2021 Published: 03 March 2022

CLC:

TP 391

Corresponding Authors: Jun-jie WANG E-mail: nyj@stu.ouc.edu.cn;wjj@ouc.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Yuan-jun NONG
	Jun-jie WANG
	Hong CHEN
	Wen-han SUN
	Hui GENG
	Shu-yue LI

Cite this article:

Yuan-jun NONG,Jun-jie WANG,Hong CHEN,Wen-han SUN,Hui GENG,Shu-yue LI. A image caption method of construction scene based on attention mechanism and encoding-decoding architecture. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 236-244.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2022.02.003 OR https://www.zjujournals.com/eng/Y2022/V56/I2/236

基于注意力机制和编码-解码架构的施工场景图像描述方法

为了实现在光线不佳、夜间施工、远距离密集小目标等复杂施工场景下的图像描述，提出基于注意力机制和编码-解码架构的施工场景图像描述方法. 采用卷积神经网络构建编码器，提取施工图像中丰富的视觉特征；利用长短时记忆网络搭建解码器，捕捉句子内部单词之间的语义特征，学习图像特征与单词语义特征之间的映射关系；引入注意力机制，关注显著性强的特征，抑制非显著性特征，减少噪声信息的干扰. 为了验证所提方法的有效性，构建一个包含10种常见施工场景的图像描述数据集. 实验结果表明，所提方法取得了较高的精度，在光线不佳、夜间施工、远距离密集小目标等复杂施工场景下具有良好的图像描述性能，且具有较强的泛化性和适应性.

关键词： 图像描述, 施工场景, 注意力机制, 编码, 解码

Fig.1 System framework of construction scene image caption model

Fig.2 LSTM cell structure

Fig.3 Visualization of attention mechanism

Tab.1 Ten common construction scenarios and corresponding number of images

Fig.4 Example of data set of construction scene image caption

Tab.2 Experiment results of different methods in image caption data set of construction scene

Tab.3 Ablation study results

Fig.5 Visualization of detection results

Fig.6 Generalization test results

Fig.7 Visualization of attention mechanism results


[1]	WU J, CAI N, CHEN W, et al Automatic detection of hardhats worn by construction personnel: a deep learning approach and benchmark dataset[J]. Automation in Construction, 2019, 106: 102894 doi: 10.1016/j.autcon.2019.102894

[2]	NATH N D, BEHZADAN A H, PAAL S G Deep learning for site safety: real-time detection of personal protective equipment[J]. Automation in Construction, 2020, 112: 103085 doi: 10.1016/j.autcon.2020.103085

[3]	GUO Y, XU Y, LI S Dense construction vehicle detection based on orientation-aware feature fusion convolutional neural network[J]. Automation in Construction, 2020, 112: 103124 doi: 10.1016/j.autcon.2020.103124

[4]	LI Y, LU Y, CHEN J A deep learning approach for real-time rebar counting on the construction site based on YOLOv3 detector[J]. Automation in Construction, 2021, 124: 103602 doi: 10.1016/j.autcon.2021.103602

[5]	徐守坤, 倪楚涵, 吉晨晨, 等一种基于安全帽佩戴检测的图像描述方法研究[J]. 小型微型计算机系统, 2020, 41 (4): 812- 819 XU Shou-kun, NI Chu-han, JI Chen-chen, et al Research on image caption method based on safety helmet wearing detection[J]. Journal of Chinese Computer Systems, 2020, 41 (4): 812- 819 doi: 10.3969/j.issn.1000-1220.2020.04.025

[6]	BANG S, KIM H Context-based information generation for managing UAV-acquired data using image captioning[J]. Automation in Construction, 2020, 112: 103116 doi: 10.1016/j.autcon.2020.103116

[7]	LIU H, WANG G, HUANG T, et al Manifesting construction activity scenes via image captioning[J]. Automation in Construction, 2020, 119: 103334 doi: 10.1016/j.autcon.2020.103334

[8]	XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// International Conference on Machine Learning. Cambridge: MIT, 2015: 2048-2057.

[9]	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context [C]// European Conference on Computer Vision. Berlin: Springer, 2014: 740-755.

[10]	HODOSH M, YOUNG P, HOCKENMAIER J Framing image description as a ranking task: data, models and evaluation metrics[J]. Journal of Artificial Intelligence Research, 2013, 47: 853- 899 doi: 10.1613/jair.3994

[11]	YOUNG P, LAI A, HODOSH M, et al From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 67- 78 doi: 10.1162/tacl_a_00166

[12]	DUTTA A, ZISSERMAN A. The VIA annotation software for images, audio and video [EB/OL]. (2019-04-24)[2021-04-08]. https://arxiv.org/abs/1904.10699.

[13]	PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// Annual Meeting on Association for Computational Linguistics. Stroudsburg: ACL, 2002: 311-318.

[14]	BANERJEE S, LAVIE A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments [C]// ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg: ACL, 2005: 65-72.

[15]	LIN C Y. ROUGE: a package for automatic evaluation of summaries[C]// ACL Workshop on Text Summarization Branches Out. Stroudsburg: ACL, 2004: 74-81.

[16]	VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2015: 4566-4575.

[17]	VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2015: 3156-3164.

[18]	LU J, XIONG C, PARIKH D, et al. Knowing when to look: adaptive attention via a visual sentinel for image captioning [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2017: 3242-3250.

[19]	RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning [C]// IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE, 2017: 1179-1195.

[1]	Xiao-chen JU,Xin-xin ZHAO,Sheng-sheng QIAN. Self-attention mechanism based bridge bolt detection algorithm[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 901-908.

[2]	You-wei WANG,Shuang TONG,Li-zhou FENG,Jian-ming ZHU,Yang LI,Fu CHEN. New inductive microblog rumor detection method based on graph convolutional network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(5): 956-966.

[3]	Xue-qin ZHANG,Tian-ren LI. Breast cancer pathological image classification based on Cycle-GAN and improved DPN network[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 727-735.

[4]	Meng XU,Dan WANG,Zhi-yuan LI,Yuan-fang CHEN. IncepA-EEGNet: P300 signal detection method based on fusion of Inception network and attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 745-753, 782.

[5]	Chang-yuan LIU,Xian-ping HE,Xiao-jun BI. Efficient network vehicle recognition combined with attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(4): 775-782.

[6]	Qiao-hong CHEN,Hao-lei PEI,Qi SUN. Image caption based on relational reasoning and context gate mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 542-549.

[7]	Ying-li LIU,Rui-gang WU,Chang-hui YAO,Tao SHEN. Construction method of extraction dataset of Al-Si alloy entity relationship[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(2): 245-253.

[8]	Xin WANG,Qiao-hong CHEN,Qi SUN,Yu-bo JIA. Visual question answering method based on relational reasoning and gating mechanism[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(1): 36-46.

[9]	Zhi-chao CHEN,Hai-ning JIAO,Jie YANG,Hua-fu ZENG. Garbage image classification algorithm based on improved MobileNet v2[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(8): 1490-1499.

[10]	Zi-ye YONG,Ji-chang GUO,Chong-yi LI. weakly supervised underwater image enhancement algorithm incorporating attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 555-562.

[11]	Han-juan CHEN,Fei-peng DA,Shao-yan GAI. Deep 3D point cloud classification network based on competitive attention fusion[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(12): 2342-2351.

[12]	Yue-lin CHEN,Wen-jing TIAN,Xiao-dong CAI,Shu-ting ZHENG. Text matching model based on dense connection networkand multi-dimensional feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(12): 2352-2358.

[13]	Wen-bin XIN,Hui-min HAO,Ming-long BU,Yuan LAN,Jia-hai HUANG,Xiao-yan XIONG. Static gesture real-time recognition method based on ShuffleNetv2-YOLOv3 model[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(10): 1815-1824.

[14]	Chuang LIU,Jun LIANG. Vehicle motion trajectory prediction based on attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(6): 1156-1163.

[15]	Yan ZHANG,Bin GUO,Qian-ru WANG,Jing ZHANG,Zhi-wen YU. SeqRec: sequential-based recommendation model with long-term preference and instant interest[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(6): 1177-1184.

Viewed

Full text

Abstract

Cited

Shared

Discussed