基于跨模态级联扩散模型的图像描述方法
|
陈巧红,郭孟浩,方贤,孙麒
|
Image captioning based on cross-modal cascaded diffusion model
|
Qiaohong CHEN,Menghao GUO,Xian FANG,Qi SUN
|
|
表 5 Microsoft COCO 数据集中不同图像描述模型的性能对比 |
Tab.5 Performance comparison of different image description models in Microsoft COCO dataset |
|
模型类别 | 模型 | B@1 | B@4 | M | R | C | 自回归方法 | SCST[35] | — | 34.2 | 26.7 | 55.7 | 114.0 | UpDown[8] | 79.8 | 36.5 | 27.7 | 57.3 | 120.1 | RFNet[36] | 79.1 | 36.5 | 27.7 | 57.3 | 121.9 | GCN-LSTM[37] | 80.5 | 38.2 | 28.5 | 58.3 | 127.6 | ORT[38] | 80.5 | 38.6 | 28.7 | 58.4 | 128.3 | AoANet[12] | 80.2 | 38.9 | 29.2 | 58.8 | 129.8 | M2-Transformer[14] | 80.8 | 39.1 | 29.2 | 58.6 | 131.2 | X-Transformer[13] | 80.9 | 39.7 | 29.5 | 59.1 | 133.8 | RSTNet[39] | 81.1 | 39.3 | 29.4 | 58.8 | 133.3 | BLIP[31] | — | 39.7 | — | — | 133.3 | ConCap[40] | — | 40.5 | 30.9 | — | 133.7 | 非自回归方法 | MNIC[16] | 75.4 | 30.9 | 27.5 | 55.6 | 108.1 | SATIC[20] | 80.6 | 37.9 | 28.6 | — | 127.2 | Bit-Diffusion[21] | — | 34.7 | — | 58.0 | 115.0 | DiffCap[22] | — | 31.6 | 26.5 | 57.0 | 104.3 | E2E[41] | 79.7 | 36.9 | 27.9 | 58.0 | 122.6 | 本研究 | 81.2 | 39.9 | 29.0 | 58.9 | 133.8 |
|
|
|