Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2020, Vol. 54 Issue (1): 126-134    DOI: 10.3785/j.issn.1008-973X.2020.01.015
Computer Technology, Information Engineering     
Image captioning based on global-local feature and adaptive-attention
Xiao-hu ZHAO1,2(),Liang-fei YIN1,2(),Cheng-long ZHAO1,2
1. National and Local Joint Engineering Laboratory of Internet Application Technology on Mine, China University of Mining and Technology, Xuzhou 221008, China
2. School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
Download: HTML     PDF(1697KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

The image captioning algorithm was proposed in order to explore the difference of the image visual features and the upper layer semantic concept. The algorithm can determine the image focus, mine higher-level semantic information, and improve the description details. Local features were added for the image visual feature extraction, and the global-local feature of the input image was combined with the global features and local features for visual information. Then the focus of the image at different time was determined, and more details of the image were caught. The attention mechanism was added to weight the image feature during decoding, so that the dependence of the text words on the visual information and the semantic information at the current moment could be adaptively adjusted, and the performance of image captioning was effectively improved. The experimental results show that the proposed method can acquire competitive captioning results than other image captioning algorithms. The method can describe the image more accurately and more comprehensively, and the recognition accuracy of tiny objects is higher than others.



Key wordsimage captioning      image focus      higher-level semantic information      description detail      global-local feature extraction      adaptive-attention mechanism     
Received: 29 April 2019      Published: 05 January 2020
CLC:  TP 391  
Cite this article:

Xiao-hu ZHAO,Liang-fei YIN,Cheng-long ZHAO. Image captioning based on global-local feature and adaptive-attention. Journal of ZheJiang University (Engineering Science), 2020, 54(1): 126-134.

URL:

http://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2020.01.015     OR     http://www.zjujournals.com/eng/Y2020/V54/I1/126


基于全局?局部特征和自适应注意力机制的图像语义描述算法

为了探究图像底层视觉特征与高层语义概念存在的差异,提出可以确定图像关注重点、挖掘更高层语义信息以及完善描述句子的细节信息的图像语义描述算法. 在图像视觉特征提取时提取输入图像的全局-局部特征作为视觉信息输入,确定不同时刻对图像的关注点,对图像细节的描述更加完善;在解码时加入注意力机制对图像特征加权输入,可以自适应选择当前时刻输出的文本单词对视觉信息与语义信息的依赖权重,有效地提高对图像语义描述的性能. 实验结果表明,该方法相对于其他语义描述算法效果更有竞争力,可以更准确、更细致地识别图片中的物体,对输入图像进行更全面地描述;对于微小的物体的识别准确率更高.


关键词: 图像语义描述,  图像关注点,  高层语义信息,  描述句子细节,  全局-局部特征提取,  自适应注意力机制 
Fig.1 Structure of image captioning based on CNN-LSTM
Fig.2 Faster R-CNN structure diagram
Fig.3 RPN structure diagram
Fig.4 Local feature extraction diagram
Fig.5 R-CNN diagram
Fig.6 Global-local feature extraction diagram
Fig.7 Traditional attention mechanism and adaptive attention mechanism
Fig.8 Adaptive attention mechanism detail map
Fig.9 Schematic diagram of overall model structure
数据集 语言 规模(张)
Flickr30k 英语 31 783
MS COCO 英语 123 000
Tab.1 Flickr30k and MS COCO experimental dataset introduction
方法 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
NIC 66.6 46.1 32.9 24.6 ? ? ?
MS Captivator 71.5 54.3 40.7 30.8 24.8 52.6 93.1
m-RNN 67 45 35 25 ? ? ?
LRCN 62.79 44.19 30.41 21 ? ? ?
MSR 73.0 56.5 42.9 32.5 25.1 ? 98.6
ATT-EK 74.0 56.0 42.0 31.0 26.0 ? ?
Soft-attention 70.7 49.2 34.4 24.3 23.9 ? ?
Hard-attention 71.8 50.4 35.7 25.0 23.0 ? ?
ATT-FCN 70.9 53.7 40.2 30.4 24.3 ? ?
Aligning-ATT 69.7 51.9 38.1 28.20 23.5 50.9 83.8
ERD ? ? ? 29.0 23.7 ? 88.6
Areas-ATT ? ? ? 30.7 24.5 ? 93.8
本文方法 74.0 60.1 43.9 35.2 27.5 52.4 98.7
Tab.2 Comparison of experimental results on Microsoft COCO caption dateset %
方法 BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR
NIC 66.6 42.3 27.7 18.3
Soft-attention 66.7 43.4 28.8 19.1 18.49
Hard-attention 44.9 43.9 28.6 19.9 18.46
ATT-FCN 64.7 46.0 32.4 23.0 18.9
本文方法 68.1 48.1 32.7 25.7 18.9
Tab.3 Comparison of experimental results on Flickr30k caption dateset %
Fig.10 Comparison of image captioning results
[1]   FARHADI A, HEJRATI M, SADEGHI M A, et al. Every picture tells a story: generating sentences from images [C] // International Conference on Computer Vision. Heraklion: Springer, 2010: 15-29.
[2]   MAO J, XU W, YANG Y, et al. Deep captioning with multimodal recurrent neural networks(m-RNN) [EB/OL]. [2014-12-20]. https://arxiv.org/abs/1412.6632.
[3]   VINYALS O, TOSHEV A, BENGIO S, et al. Show and tell: a neural image caption generator [C] // IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 3156-3164.
[4]   WU Q, SHEN C, LIU L, et al What value do explicit high level concepts have in vision to language problems[J]. Computer Science, 2016, 12 (1): 1640- 1649
[5]   ZHOU L, XU C, KOCH P, et al. Watch what you just said: image captioning with text-conditional attention [C] // Proceedings of the on Thematic Workshops of ACM Multimedia. [S.l]: Association for Computing Machinery, 2017: 305-313.
[6]   RENNIE S J, MARCHERET E, ROUEH Y, et al. Self-critical sequence training for image captioning [C] // IEEE Conference on Computer Vision and Pattern Recognition. Maryland: IEEE, 2017: 1179-1195.
[7]   SIMONYAN K, ZISSERMAN A Very deep convolutional networks for large-scale image recognition[J]. Computer Science, 2014, 32 (2): 67- 85
[8]   REN S, HE K, GIRSHICK R, et al Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (6): 1137- 1149
[9]   XU K, BA J, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention [C]// Computer Science. Lille: IMLS, 2015: 2048-2057.
[10]   FANG H, GUPTA S, IANDOLA F, et al. From captions to visual concepts and back [C] // IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1473-1482.
[11]   DONAHUE J, HENDRICKS L A, ROHRBACH M, et al Long-term recurrent convolutional networks for visual recognition and description[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 39 (4): 677- 691
[12]   WU Q, SHEN C, WANG P, et al Image captioning and visual question answering based on attributes and external knowledge[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40 (6): 1367- 1381
[13]   YAO T, PAN Y, LI Y, et al. Boosting image captioning with attributes [C]// IEEE International Conference on Computer Vision. Venice, Italy: IEEE, 2017: 4904- 4912.
[14]   YOU Q, JIN H, WANG Z, et al. Image captioning with semantic attention [C] // IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 4651- 4659.
[15]   JIN J, FU K, CUI R, et al Aligning where to see and what to tell: image caption with region-based attention and scene factorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39 (12): 2321- 2334
[16]   YANG Z, YUAN Y, WU Y, et al. Encode, review, and decode: reviewer module for caption generation [C] // International Conference on Neural Image Processing System. Barcelona: [s. n.], 2016.
[1] Shou-guo ZHENG,Yong-de ZHANG,Wen-tian XIE,Hu FAN,Qing WANG. Aircraft final assembly line modeling based on digital twin[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 843-854.
[2] Shi-lin ZHANG,Si-ming MA,Zi-qian GU. Large margin metric learning based vehicle re-identification method[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 948-956.
[3] Peng SONG,De-dong YANG,Chang LI,Chang GUO. An adaptive siamese network tracking algorithm based on global feature channel recognition[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 966-975.
[4] Jun CAI,Gang ZHAO,Yong YU,Qiang-wei BAO,Sheng DAI. A rapid reconstruction method of simulation model based on point cloud and design model[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(5): 905-916.
[5] Hong-li WANG,Bin GUO,Si-cong LIU,Jia-qi LIU,Yun-gang WU,Zhi-wen YU. End context-adaptative deep sensing model with edge-end collaboration[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 626-638.
[6] Teng ZHANG,Xin-long JIANG,Yi-qiang CHEN,Qian CHEN,Tao-mian MI,Piu CHAN. Wrist attitude-based Parkinson's disease ON/OFF state assessment after medication[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(4): 639-647.
[7] Ying-jie ZHENG,Song-rong WU,Ruo-yu WEI,Zhen-wei TU,Jin LIAO,Dong LIU. Metro location point matching and false alarm elimination based on FCM algorithm of target image[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 586-593.
[8] Zi-ye YONG,Ji-chang GUO,Chong-yi LI. weakly supervised underwater image enhancement algorithm incorporating attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 555-562.
[9] Yong YU,Jing-yuan XUE,Sheng DAI,Qiang-wei BAO,Gang ZHAO. Quality prediction and process parameter optimization method for machining parts[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(3): 441-447.
[10] Hui-ya HU,Shao-yan GAI,Fei-peng DA. Face frontalization based on generative adversarial network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 116-123.
[11] Yang-bo CHEN,Guo-dong YI,Shu-you ZHANG. Surface warpage detection method based on point cloud feature comparison[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 81-88.
[12] You-kang DUAN,Xiao-gang CHEN,Jian GUI,Bin MA,Shun-fen LI,Zhi-tang SONG. Continuous kinematics prediction of lower limbs based on phase division[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(1): 89-95.
[13] Tai-heng ZHANG,Biao MEI,Lei QIAO,Hao-jie YANG,Wei-dong ZHU. Detection method for composite hole guided by texture boundary[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2294-2300.
[14] Dong LIANG,Xin-yu LIU,Jia-xing PAN,Han SUN,Wen-jun ZHOU,Shun’ichi KANEKO. Foreground segmentation under dynamic background based on self-updating co-occurrence pixel[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2405-2413.
[15] Yao JIN,Wei ZHANG. Real-time fire detection algorithm with Anchor-Free network architecture[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(12): 2430-2436.