基于多视图跨模态特征融合的图像描述生成
|
|
张乃洲,赵云超,曹薇,张啸剑
|
Image captioning generation based on multiple-view cross-modal feature fusion
|
|
Naizhou ZHANG,Yunchao ZHAO,Wei CAO,Xiaojian ZHANG
|
|
| 表 1 在MSCOCO测试数据集上与其他先进模型在单一模型上的性能比较 |
| Tab.1 Comparison with other state-of-the-art model on MSCOCO test dataset in single-model setting % |
|
| 模型 | BLEU-1 | BLEU-4 | METEOR | ROUGE-L | CIDEr | SPICE | | SCST[6] | — | 34.2 | 26.7 | 55.7 | 114.0 | — | | AoANet[9] | 80.2 | 38.9 | 29.2 | 58.8 | 129.8 | 22.4 | | X-Transformer[10] | 80.9 | 39.7 | 29.5 | 59.1 | 132.8 | 23.4 | | M2Transformer[11] | 80.8 | 39.1 | 29.2 | 58.6 | 131.2 | 22.6 | | GET[12] | 81.5 | 39.5 | 29.3 | 58.9 | 131.6 | 22.8 | | RSTNet[13] | 81.8 | 40.1 | 29.8 | 59.5 | 135.6 | 23.3 | | DLCT[14] | 81.4 | 39.8 | 29.5 | 59.1 | 133.8 | 23.0 | | Xmodal-Ctx[16] | 81.5 | 39.7 | 30.0 | 59.5 | 135.9 | 23.7 | | DIFNet[8] | 81.7 | 40.0 | 29.7 | 59.4 | 136.2 | 23.2 | | PureT[21] | 82.1 | 40.9 | 30.2 | 60.1 | 138.2 | 24.2 | | VRCDA[33] | 80.6 | 37.9 | 28.4 | 58.2 | 123.7. | 21.8 | | EVCAP[34] | — | 41.5 | 31.2 | — | 140.1 | 24.7 | | MVCMFAF (本文) | 83.2 | 41.6 | 30.4 | 60.5 | 140.6 | 24.4 |
|
|
|