基于对比学习的声源定位引导视听分割模型
|
黄文湖,赵邢,谢亮,梁浩然,梁荣华
|
Contrastive learning-based sound source localization-guided audio-visual segmentation model
|
Wenhu HUANG,Xing ZHAO,Liang XIE,Haoran LIANG,Ronghua LIANG
|
|
表 1 SSL2AVS与基线模型的性能比较 |
Tab.1 Performance comparison between SSL2AVS and baseline model |
|
方法 | 图像编码器 | FPS/(帧·s−1) | S4 | | MS3 | $ {M_{\text{F}}} $/% | $ {M_{\text{J}}} $/% | | $ {M_{\text{F}}} $/% | $ {M_{\text{J}}} $/% | AVSegFormer-R50 | ResNet-50 | 114.97 | 85.9 | 76.45 | | 62.8 | 49.53 | SSL2AVS-R50 | ResNet-50 | 96.39 | 86.5 | 76.87 | | 63.3 | 52.49 | AVSegFormer-R50+ | ResNet-50 | 42.53 | 86.4 | 76.11 | | 58.0 | 43.41 | SSL2AVS-R50+ | ResNet-50 | 36.33 | 86.8 | 77.16 | | 66.9 | 56.18 | AVSegFormer-R50* | ResNet-50 | 30.96 | 86.7 | 76.38 | | 65.6 | 53.81 | SSL2AVS-R50* | ResNet-50 | 26.11 | 88.0 | 78.68 | | 69.8 | 59.50 | AVSegFormer-PVT | PVT v2 | 81.83 | 89.9 | 82.06 | | 69.3 | 58.36 | SSL2AVS-PVT | PVT v2 | 80.06 | 90.3 | 82.42 | | 72.3 | 62.15 | AVSegFormer-PVT* | PVT v2 | 22.79 | 90.5 | 83.06 | | 73.0 | 61.33 | SSL2AVS-PVT* | PVT v2 | 21.20 | 91.6 | 84.43 | | 75.6 | 65.16 |
|
|
|