基于跨模态级联扩散模型的图像描述方法
陈巧红,郭孟浩,方贤,孙麒

Image captioning based on cross-modal cascaded diffusion model
Qiaohong CHEN,Menghao GUO,Xian FANG,Qi SUN
表 5 Microsoft COCO 数据集中不同图像描述模型的性能对比
Tab.5 Performance comparison of different image description models in Microsoft COCO dataset
模型类别模型B@1B@4MRC
自回归方法SCST[35]34.226.755.7114.0
UpDown[8]79.836.527.757.3120.1
RFNet[36]79.136.527.757.3121.9
GCN-LSTM[37]80.538.228.558.3127.6
ORT[38]80.538.628.758.4128.3
AoANet[12]80.238.929.258.8129.8
M2-Transformer[14]80.839.129.258.6131.2
X-Transformer[13]80.939.729.559.1133.8
RSTNet[39]81.139.329.458.8133.3
BLIP[31]39.7133.3
ConCap[40]40.530.9133.7
非自回归方法MNIC[16]75.430.927.555.6108.1
SATIC[20]80.637.928.6127.2
Bit-Diffusion[21]34.758.0115.0
DiffCap[22]31.626.557.0104.3
E2E[41]79.736.927.958.0122.6
本研究81.239.929.058.9133.8