Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (9): 1803-1813    DOI: 10.3785/j.issn.1008-973X.2025.09.004
计算机技术     
基于对比学习的声源定位引导视听分割模型
黄文湖(),赵邢*(),谢亮,梁浩然,梁荣华
浙江工业大学 计算机科学与技术学院,浙江 杭州 310023
Contrastive learning-based sound source localization-guided audio-visual segmentation model
Wenhu HUANG(),Xing ZHAO*(),Liang XIE,Haoran LIANG,Ronghua LIANG
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
 全文: PDF(2271 KB)   HTML
摘要:

针对视听分割任务中背景噪声阻碍有效信息交互和物体辨别的问题,提出基于对比学习的声源定位引导视听分割模型(SSL2AVS). 采用从定位到分割的两阶段策略,通过声源定位引导视觉特征优化,从而减少背景噪声干扰,使模型适用于复杂场景中的视听分割. 在分割前引入目标定位模块,利用对比学习方法对齐视听模态并生成声源热力图,实现发声物体粗定位;引入特征增强模块,构建多尺度特征金字塔网络,利用定位结果动态地加权融合浅层空间细节特征与深层语义特征,在引导增强目标物体视觉特征的同时抑制背景噪声. 2个模块协同作用,增强物体的视觉表示,使模型专注于物体辨识. 为了优化定位结果,提出辅助定位损失函数,促使模型关注与音频特征匹配的图像区域. 实验结果表明,模型在MS3数据集上的mIoU为62.15,高于基线AVSegFormer模型.

关键词: 视听分割跨模态交互声源定位对比学习特征增强    
Abstract:

A sound source localization-guided audio-visual segmentation (SSL2AVS) model based on contrastive learning was proposed to address the problem that background noise hindered effective information exchange and object discrimination in audio-visual segmentation (AVS) tasks. A two-stage localization-to-segmentation progressive strategy was adopted, where visual features were refined through sound source localization to suppress background interference, making the model suitable for audio-visual segmentation in complex scenes. Prior to segmentation, a target localization module was introduced to align audio-visual modalities via the contrastive learning method, generating sound source heatmaps to achieve preliminary sound source localization. A multi-scale feature pyramid network incorporating a feature enhancement module was constructed to dynamically weight and fuse the shallow spatial detail features and the deep semantic features based on the localization results, effectively amplifying the visual features of target objects while suppressing background noise. The synergistic operation of the two modules improved visual representations of objects and enabled the model to focus on object identification. An auxiliary localization loss function was proposed to optimize localization results by encouraging the model to focus on the image regions that matched audio features. Experimental results on the MS3 dataset demonstrated that the model achieved a mean Intersection over Union (mIoU) of 62.15, surpassing the baseline AVSegFormer model.

Key words: audio-visual segmentation    cross-modal interaction    sound source localization    contrastive learning    feature enhancement
收稿日期: 2024-12-04 出版日期: 2025-08-25
CLC:  TP 391.4  
基金资助: 国家自然科学基金资助项目(62402441, 62432014, 62176235); 浙江省自然科学基金资助项目(LDT23F0202, LDT23F02021F02).
通讯作者: 赵邢     E-mail: 211123120094@zjut.edu.cn;xing@zjut.edu.cn
作者简介: 黄文湖(2001—),男,硕士生,从事目标检测研究. orcid.org/0009-0000-4908-1487. E-mail:211123120094@zjut.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
黄文湖
赵邢
谢亮
梁浩然
梁荣华

引用本文:

黄文湖,赵邢,谢亮,梁浩然,梁荣华. 基于对比学习的声源定位引导视听分割模型[J]. 浙江大学学报(工学版), 2025, 59(9): 1803-1813.

Wenhu HUANG,Xing ZHAO,Liang XIE,Haoran LIANG,Ronghua LIANG. Contrastive learning-based sound source localization-guided audio-visual segmentation model. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1803-1813.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.09.004        https://www.zjujournals.com/eng/CN/Y2025/V59/I9/1803

图 1  基于对比学习的声源定位引导视听分割模型结构
图 2  特征增强模块结构
方法图像编码器FPS/(帧·s?1)S4MS3
$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%
AVSegFormer-R50ResNet-50114.9785.976.4562.849.53
SSL2AVS-R50ResNet-5096.3986.576.8763.352.49
AVSegFormer-R50+ResNet-5042.5386.476.1158.043.41
SSL2AVS-R50+ResNet-5036.3386.877.1666.956.18
AVSegFormer-R50*ResNet-5030.9686.776.3865.653.81
SSL2AVS-R50*ResNet-5026.1188.078.6869.859.50
AVSegFormer-PVTPVT v281.8389.982.0669.358.36
SSL2AVS-PVTPVT v280.0690.382.4272.362.15
AVSegFormer-PVT*PVT v222.7990.583.0673.061.33
SSL2AVS-PVT*PVT v221.2091.684.4375.665.16
表 1  SSL2AVS与基线模型的性能比较
方法图像编码器S4MS3
$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%
AVSBench[9]ResNet-5084.872.8057.847.90
PVT v287.978.7064.554.00
ECMVAE[23]ResNet-5086.576.3360.748.69
PVT v290.181.7470.857.84
CATR[19]ResNet-5086.674.8065.352.80
PVT v289.681.4070.059.00
AVSC[36]ResNet-5085.277.0261.549.58
PVT v288.280.5765.158.22
AVS-UFE[32]ResNet-5087.578.9664.555.88
PVT v290.483.1570.961.95
COMBO[10]ResNet-5090.181.7066.654.50
PVT v291.984.7071.259.20
AVSegFormer[20]ResNet-5085.976.4562.849.53
PVT v289.982.0669.358.36
SSL2AVSResNet-5086.877.1666.956.18
PVT v290.382.4272.362.15
表 2  SSL2AVS与现有视听分割方法的性能比较
去除/保留
最大池
化层
AVSegFormerSSL2AVS
S4MS3S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
去除76.1186.443.4158.077.1686.856.1866.9
保留76.4585.949.5362.876.8786.552.4963.3
表 3  去除池化层引起的特征尺寸变化对实验结果的影响
图 3  是否去除池化层、经过预训练和使用$ {\text{ACT}} $激活的视听分割结果比较
帧尺寸方法从头训练S4预训练
ResNet-50PVT v2ResNet-50PVT v2
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
224×224AVSBench47.8857.854.0064.554.3357.34
ECMVAE48.6960.757.8470.857.5667.460.8172.9
AuTR49.4161.256.2167.256.0066.060.9572.5
AVS-UFE55.8864.561.9570.959.3264.47
AVSegFormer49.5362.858.3669.353.7364.360.9272.0
SSL2AVS52.4963.362.1572.359.8469.464.1674.7
SSL2AVS+56.1866.962.4972.5
512×512AVSegFormer53.8165.661.3373.054.4963.761.3773.1
SSL2AVS59.5069.865.1675.664.2774.368.4477.8
表 4  MS3子集上不同初始化策略的性能比较
激活函数图像编码器S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
ACTResNet-5078.6888.059.5069.8
PVT v284.4391.665.1675.6
SigmoidResNet-5078.1987.855.4066.1
PVT v284.1591.361.6972.9
ResNet-5077.7687.258.7669.3
PVT v284.3391.563.3074.3
表 5  ACT激活对实验结果的影响
方法图像
编码器
$N_{\mathrm{p}} / 10^6 $MS3S4
10%30%10%30%
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
AVSegFormerResNet-50126.7444.5856.947.4060.770.1384.374.8185.7
PVT v2183.9552.7564.855.1667.979.6888.881.5589.7
SSL2AVSResNet-50136.4553.9066.057.7669.271.8084.376.6787.1
PVT v2179.7259.5769.262.5872.680.9489.882.7590.8
表 6  SSL2AVS与基线模型使用少量训练数据的性能比较
特征增强图像编码器$N_{\mathrm{p}} / 10^6 $S4MS3
$ {M_{\mathrm{J}}} $/%$ {M_{\mathrm{F}}} $/%$ {M_{\mathrm{J}}} $/%$ {M_{\mathrm{F}}} $/%
ResNet-50120.7577.5987.553.3764.0
PVT v2177.9884.2491.459.9671.6
ResNet-50136.4578.6888.059.5069.8
PVT v2179.7284.4391.665.1675.6
表 7  特征增强模块对模型性能的影响
图 4  是否经过特征增强的特征比较
$ {L_{{\text{loc}}}} $$ {L_{{\text{cts}}}} $S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
78.4287.957.6268.1
78.3487.959.1169.8
78.6888.059.5069.8
表 8  对比损失和定位损失对模型性能的影响
融合方式S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
不进行任何融合77.2986.955.9866.9
仅视觉融合77.6687.358.8469.0
仅音频融合77.8487.257.5768.6
双向融合78.6888.059.5069.8
表 9  双向注意力融合模块对模型性能的影响
查询数量S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
10077.8787.258.5168.6
20077.8487.458.8869.0
30078.6888.059.5069.8
50077.7587.358.8969.0
表 10  查询数量对模型性能的影响
可学习嵌入S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
不添加77.4387.357.4768.0
添加78.6888.059.5069.8
表 11  可学习查询对模型性能的影响
编号$ {\lambda _{\text{1}}} $$ {\lambda _{\text{2}}} $$ {\lambda _{\text{3}}} $MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
00.000.000.0056.6767.4
10.000.000.0556.8668.1
20.000.000.1057.6268.1
30.000.000.5056.8867.1
40.000.050.1059.0769.5
50.000.100.1059.1169.8
60.000.500.1057.7268.2
70.010.100.1059.5069.8
80.050.100.1055.4366.6
90.100.100.1057.5568.1
表 12  损失系数对模型性能的影响
图 5  SSL2AVS与AVSegFormer在S4和MS3子集上的模型性能定性比较
1 ARANDJELOVIĆ R, ZISSERMAN A. Look, listen and learn [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 609–617.
2 ARANDJELOVIĆ R, ZISSERMAN A. Objects that sound [C]// Proceedings of the European Conference on Computer Vision. Murich: ECVA, 2018: 451–466.
3 QIAN R, HU D, DINKEL H, et al. Multiple sound sources localization from coarse to fine [C]// Proceedings of the European Conference on Computer Vision. Glasgow: ECVA, 2020: 292–308.
4 MO S, MORGADO P. Localizing visual sounds the easy way [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: ECVA, 2022: 218–234.
5 HU X, CHEN Z, OWENS A. Mix and localize: localizing sound sources in mixtures [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10473–10482.
6 HU D, WEI Y, QIAN R, et al Class-aware sounding objects localization via audiovisual correspondence[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 9844- 9859
7 MO S, TIAN Y. Audio-visual grouping network for sound localization from mixtures [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 10565–10574.
8 MINAEE S, BOYKOV Y, PORIKLI F, et al Image segmentation using deep learning: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (7): 3523- 3542
9 ZHOU J, WANG J, ZHANG J, et al. Audio–visual segmentation [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: ECVA, 2022: 386–403.
10 YANG Q, NIE X, LI T, et al. Cooperation does matter: exploring multi-order bilateral relations for audio-visual segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 27124–27133.
11 LIU J, WANG Y, JU C, et al. Annotation-free audio-visual segmentation [C]// Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2024: 5592–5602.
12 DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. (2020-10-22) [2025-01-10]. https://arxiv.org/abs/2010.11929.
13 WANG R, TANG D, DUAN N, et al. K-adapter: infusing knowledge into pre-trained models with adapters [EB/OL]. (2020-02-05) [2025-01-10]. https://arxiv.org/abs/2002.01808.
14 KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything [C]// Proceedings of the IEEE International Conference on Computer Vision. Paris: IEEE, 2023: 3992–4003.
15 WANG Y, LIU W, LI G, et al. Prompting segmentation with sound is generalizable audio-visual source localizer [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 5669–5677.
16 MA J, SUN P, WANG Y, et al. Stepping stones: a progressive training strategy for audio-visual semantic segmentation [C]// Proceedings of the European Conference on Computer Vision. Milan: ECVA, 2024: 311–327.
17 CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask Transformer for universal image segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1280–1289.
18 LIU J, JU C, MA C, et al. Audio-aware query-enhanced Transformer for audio-visual segmentation [EB/OL]. (2023-07-25) [2025-01-10]. https://arxiv.org/abs/2307.13236.
19 LI K, YANG Z, CHEN L, et al. CATR: combinatorial-dependence audio-queried Transformer for audio-visual video segmentation [C]// Proceedings of the 31st ACM International Conference on Multimedia. Ottawa: ACM, 2023: 1485–1494.
20 GAO S, CHEN Z, CHEN G, et al. AVSegFormer: audio-visual segmentation with Transformer [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 12155–12163.
21 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. Long Beach: NeurIPS Foundation, 2017: 6000–6010.
22 XU B, LIANG H, LIANG R, et al. Locate globally, segment locally: a progressive architecture with knowledge review network for salient object detection [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2021: 3004–3012.
23 MAO Y, ZHANG J, XIANG M, et al. Multimodal variational auto-encoder based audio-visual segmentation [C]// Proceedings of the IEEE International Conference on Computer Vision. Paris: IEEE, 2023: 954–965.
24 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
25 WANG W, XIE E, LI X, et al PVT v2: improved baselines with pyramid vision Transformer[J]. Computational Visual Media, 2022, 8 (3): 415- 424
doi: 10.1007/s41095-022-0274-8
26 HERSHEY S, CHAUDHURI S, ELLIS D P W, et al. CNN architectures for large-scale audio classification [C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 131–135.
27 RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation [C]// Proceedings of the Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 2015: 234–241.
28 WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. Munich: ECVA, 2018: 3–19.
29 ZHAO X, LIANG H, LI P, et al Motion-aware memory network for fast video salient object detection[J]. IEEE Transactions on Image Processing, 2024, 33: 709- 721
30 WANG Q, WU B, ZHU P, et al. ECA-net: efficient channel attention for deep convolutional neural networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531–11539.
31 MILLETARI F, NAVAB N, AHMADI S A. V-net: fully convolutional neural networks for volumetric medical image segmentation [C]// Fourth International Conference on 3D Vision. Stanford: IEEE, 2016: 565–571.
32 LIU J, LIU Y, ZHANG F, et al. Audio-visual segmentation via unlabeled frame exploitation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 26318–26329.
33 LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization [EB/OL]. (2017-11-14) [2025-01-10]. https://arxiv.org/abs/1711.05101.
34 DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 248–255.
35 GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: an ontology and human-labeled dataset for audio events [C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 776–780.
[1] 侯越,李前辉,袁鹏,张鑫,王甜甜,郝紫微. 基于对比学习的可扩展交通图像自动标注方法[J]. 浙江大学学报(工学版), 2025, 59(8): 1634-1643.
[2] 周昶清,侯耀春,武鹏,杨帅,吴大转. 自适应齿轮箱稀疏表示原子构建方法[J]. 浙江大学学报(工学版), 2025, 59(5): 1018-1030.
[3] 陈珂,张文浩. 基于对比学习的零样本对象谣言检测[J]. 浙江大学学报(工学版), 2024, 58(9): 1790-1800.
[4] 周天琪,杨艳,张继杰,殷少伟,郭增强. 基于无负样本损失和自适应增强的图对比学习[J]. 浙江大学学报(工学版), 2023, 57(2): 259-266.
[5] 谢誉,包梓群,张娜,吴彪,涂小妹,包晓安. 基于特征优化与深层次融合的目标检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2403-2415.
[6] 李明,段立娟,王文健,恩擎. 基于显著稀疏强关联的脑功能连接分类方法[J]. 浙江大学学报(工学版), 2022, 56(11): 2232-2240.
[7] 刘嘉诚,冀俊忠. 基于宽度学习系统的fMRI数据分类方法[J]. 浙江大学学报(工学版), 2021, 55(7): 1270-1278.
[8] 陈雪云,夏瑾,杜珂. 基于多线型特征增强网络的架空输电线检测[J]. 浙江大学学报(工学版), 2021, 55(12): 2382-2389.
[9] 郑浦,白宏阳,李伟,郭宏伟. 复杂背景下的小目标检测算法[J]. 浙江大学学报(工学版), 2020, 54(9): 1777-1784.
[10] 郑珍珍 冯华君 沈常宇 丁驰竹 李奇究. 基于坐标系变换的三维声源定位算法[J]. J4, 2008, 42(2): 341-343.