基于多视图跨模态特征融合的图像描述生成
张乃洲,赵云超,曹薇,张啸剑

Image captioning generation based on multiple-view cross-modal feature fusion
Naizhou ZHANG,Yunchao ZHAO,Wei CAO,Xiaojian ZHANG
表 1 在MSCOCO测试数据集上与其他先进模型在单一模型上的性能比较
Tab.1 Comparison with other state-of-the-art model on MSCOCO test dataset in single-model setting %
模型BLEU-1BLEU-4METEORROUGE-LCIDErSPICE
SCST[6]34.226.755.7114.0
AoANet[9]80.238.929.258.8129.822.4
X-Transformer[10]80.939.729.559.1132.823.4
M2Transformer[11]80.839.129.258.6131.222.6
GET[12]81.539.529.358.9131.622.8
RSTNet[13]81.840.129.859.5135.623.3
DLCT[14]81.439.829.559.1133.823.0
Xmodal-Ctx[16]81.539.730.059.5135.923.7
DIFNet[8]81.740.029.759.4136.223.2
PureT[21]82.140.930.260.1138.224.2
VRCDA[33]80.637.928.458.2123.7.21.8
EVCAP[34]41.531.2140.124.7
MVCMFAF (本文)83.241.630.460.5140.624.4