基于对比学习的声源定位引导视听分割模型
|
黄文湖,赵邢,谢亮,梁浩然,梁荣华
|
Contrastive learning-based sound source localization-guided audio-visual segmentation model
|
Wenhu HUANG,Xing ZHAO,Liang XIE,Haoran LIANG,Ronghua LIANG
|
|
表 4 MS3子集上不同初始化策略的性能比较 |
Tab.4 Performance comparison of different initialization strategies on MS3 sub-dataset |
|
帧尺寸 | 方法 | 从头训练 | | S4预训练 | ResNet-50 | | PVT v2 | | ResNet-50 | | PVT v2 | $ {M_{\text{J}}} $/% | $ {M_{\text{F}}} $/% | | $ {M_{\text{J}}} $/% | $ {M_{\text{F}}} $/% | | $ {M_{\text{J}}} $/% | $ {M_{\text{F}}} $/% | | $ {M_{\text{J}}} $/% | $ {M_{\text{F}}} $/% | 224×224 | AVSBench | 47.88 | 57.8 | | 54.00 | 64.5 | | 54.33 | — | | 57.34 | — | ECMVAE | 48.69 | 60.7 | | 57.84 | 70.8 | | 57.56 | 67.4 | | 60.81 | 72.9 | AuTR | 49.41 | 61.2 | | 56.21 | 67.2 | | 56.00 | 66.0 | | 60.95 | 72.5 | AVS-UFE | 55.88 | 64.5 | | 61.95 | 70.9 | | 59.32 | — | | 64.47 | — | AVSegFormer | 49.53 | 62.8 | | 58.36 | 69.3 | | 53.73 | 64.3 | | 60.92 | 72.0 | SSL2AVS | 52.49 | 63.3 | | 62.15 | 72.3 | | 59.84 | 69.4 | | 64.16 | 74.7 | SSL2AVS+ | 56.18 | 66.9 | | — | — | | 62.49 | 72.5 | | — | — | 512×512 | AVSegFormer | 53.81 | 65.6 | | 61.33 | 73.0 | | 54.49 | 63.7 | | 61.37 | 73.1 | SSL2AVS | 59.50 | 69.8 | | 65.16 | 75.6 | | 64.27 | 74.3 | | 68.44 | 77.8 |
|
|
|