基于多视图跨模态特征融合的图像描述生成
张乃洲,赵云超,曹薇,张啸剑

Image captioning generation based on multiple-view cross-modal feature fusion
Naizhou ZHANG,Yunchao ZHAO,Wei CAO,Xiaojian ZHANG
表 5 MVCMFAF模型与其他模型在计算量、参数量和推理时间方面的比较
Tab.5 Comparison of computational complexity, parameter quantity and inference time between MVCMFAF model and other model
模型FLOPs/109Np/MBt/ms
Xmodal-Ctx[16]127.61435.439137.107
DIFNet[8]137.41228.39598.244
PureT[21]882.301224.201238.937
MVCMFAF (本文)137.461175.769446.157