基于对比学习的声源定位引导视听分割模型

基于对比学习的声源定位引导视听分割模型

黄文湖,赵邢,谢亮,梁浩然,梁荣华

Contrastive learning-based sound source localization-guided audio-visual segmentation model

Wenhu HUANG,Xing ZHAO,Liang XIE,Haoran LIANG,Ronghua LIANG

表 1 SSL2AVS与基线模型的性能比较

Tab.1 Performance comparison between SSL2AVS and baseline model

方法	图像编码器	FPS/(帧·s⁻¹)	S4			MS3
方法	图像编码器	FPS/(帧·s⁻¹)	$ {M_{\text{F}}} $/%	$ {M_{\text{J}}} $/%		$ {M_{\text{F}}} $/%	$ {M_{\text{J}}} $/%
AVSegFormer-R50	ResNet-50	114.97	85.9	76.45		62.8	49.53
SSL2AVS-R50	ResNet-50	96.39	86.5	76.87		63.3	52.49
AVSegFormer-R50+	ResNet-50	42.53	86.4	76.11		58.0	43.41
SSL2AVS-R50+	ResNet-50	36.33	86.8	77.16		66.9	56.18
AVSegFormer-R50*	ResNet-50	30.96	86.7	76.38		65.6	53.81
SSL2AVS-R50*	ResNet-50	26.11	88.0	78.68		69.8	59.50
AVSegFormer-PVT	PVT v2	81.83	89.9	82.06		69.3	58.36
SSL2AVS-PVT	PVT v2	80.06	90.3	82.42		72.3	62.15
AVSegFormer-PVT*	PVT v2	22.79	90.5	83.06		73.0	61.33
SSL2AVS-PVT*	PVT v2	21.20	91.6	84.43		75.6	65.16