基于多视图跨模态特征融合的图像描述生成
张乃洲,赵云超,曹薇,张啸剑

Image captioning generation based on multiple-view cross-modal feature fusion
Naizhou ZHANG,Yunchao ZHAO,Wei CAO,Xiaojian ZHANG
表 4 在MSCOCO测试数据集上的消融实验结果
Tab.4 Ablation experimental result on MSCOCO test dataset %
模块模型BLEU-1BLEU-4METEORROUGE-LCIDErSPICE
CAMVCMFF模块CAMVCMFF(w/o grid features)82.140.730.159.8137.323.7
CAMVCMFF(w/o region features)80.940.229.259.7132.922.9
CAMVCMFF(w/o clip features)80.139.628.857.7130.222.5
CAMVCMFF(w/o clip-txt)80.539.929.158.6131.822.7
CAMVCMFF(w/o clip-visual)81.740.729.959.7135.323.8
Swin EncoderSwin Encoder(w/o global features)82.441.130.160.2138.324.1
Swin Encoder(w/ Transformer)82.240.229.859.6133.523.3
MVCMFAF (本文)83.241.630.460.5140.624.4