基于多视图跨模态特征融合的图像描述生成
|
|
张乃洲,赵云超,曹薇,张啸剑
|
Image captioning generation based on multiple-view cross-modal feature fusion
|
|
Naizhou ZHANG,Yunchao ZHAO,Wei CAO,Xiaojian ZHANG
|
|
| 表 4 在MSCOCO测试数据集上的消融实验结果 |
| Tab.4 Ablation experimental result on MSCOCO test dataset % |
|
| 模块 | 模型 | BLEU-1 | BLEU-4 | METEOR | ROUGE-L | CIDEr | SPICE | | CAMVCMFF模块 | CAMVCMFF(w/o grid features) | 82.1 | 40.7 | 30.1 | 59.8 | 137.3 | 23.7 | | CAMVCMFF(w/o region features) | 80.9 | 40.2 | 29.2 | 59.7 | 132.9 | 22.9 | | CAMVCMFF(w/o clip features) | 80.1 | 39.6 | 28.8 | 57.7 | 130.2 | 22.5 | | CAMVCMFF(w/o clip-txt) | 80.5 | 39.9 | 29.1 | 58.6 | 131.8 | 22.7 | | CAMVCMFF(w/o clip-visual) | 81.7 | 40.7 | 29.9 | 59.7 | 135.3 | 23.8 | | Swin Encoder | Swin Encoder(w/o global features) | 82.4 | 41.1 | 30.1 | 60.2 | 138.3 | 24.1 | | Swin Encoder(w/ Transformer) | 82.2 | 40.2 | 29.8 | 59.6 | 133.5 | 23.3 | | MVCMFAF (本文) | 83.2 | 41.6 | 30.4 | 60.5 | 140.6 | 24.4 |
|
|
|