Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2025, Vol. 59 Issue (9): 1803-1813    DOI: 10.3785/j.issn.1008-973X.2025.09.004
    
Contrastive learning-based sound source localization-guided audio-visual segmentation model
Wenhu HUANG(),Xing ZHAO*(),Liang XIE,Haoran LIANG,Ronghua LIANG
College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
Download: HTML     PDF(2271KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A sound source localization-guided audio-visual segmentation (SSL2AVS) model based on contrastive learning was proposed to address the problem that background noise hindered effective information exchange and object discrimination in audio-visual segmentation (AVS) tasks. A two-stage localization-to-segmentation progressive strategy was adopted, where visual features were refined through sound source localization to suppress background interference, making the model suitable for audio-visual segmentation in complex scenes. Prior to segmentation, a target localization module was introduced to align audio-visual modalities via the contrastive learning method, generating sound source heatmaps to achieve preliminary sound source localization. A multi-scale feature pyramid network incorporating a feature enhancement module was constructed to dynamically weight and fuse the shallow spatial detail features and the deep semantic features based on the localization results, effectively amplifying the visual features of target objects while suppressing background noise. The synergistic operation of the two modules improved visual representations of objects and enabled the model to focus on object identification. An auxiliary localization loss function was proposed to optimize localization results by encouraging the model to focus on the image regions that matched audio features. Experimental results on the MS3 dataset demonstrated that the model achieved a mean Intersection over Union (mIoU) of 62.15, surpassing the baseline AVSegFormer model.



Key wordsaudio-visual segmentation      cross-modal interaction      sound source localization      contrastive learning      feature enhancement     
Received: 04 December 2024      Published: 25 August 2025
CLC:  TP 391.4  
Fund:  国家自然科学基金资助项目(62402441, 62432014, 62176235); 浙江省自然科学基金资助项目(LDT23F0202, LDT23F02021F02).
Corresponding Authors: Xing ZHAO     E-mail: 211123120094@zjut.edu.cn;xing@zjut.edu.cn
Cite this article:

Wenhu HUANG,Xing ZHAO,Liang XIE,Haoran LIANG,Ronghua LIANG. Contrastive learning-based sound source localization-guided audio-visual segmentation model. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1803-1813.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.09.004     OR     https://www.zjujournals.com/eng/Y2025/V59/I9/1803


基于对比学习的声源定位引导视听分割模型

针对视听分割任务中背景噪声阻碍有效信息交互和物体辨别的问题,提出基于对比学习的声源定位引导视听分割模型(SSL2AVS). 采用从定位到分割的两阶段策略,通过声源定位引导视觉特征优化,从而减少背景噪声干扰,使模型适用于复杂场景中的视听分割. 在分割前引入目标定位模块,利用对比学习方法对齐视听模态并生成声源热力图,实现发声物体粗定位;引入特征增强模块,构建多尺度特征金字塔网络,利用定位结果动态地加权融合浅层空间细节特征与深层语义特征,在引导增强目标物体视觉特征的同时抑制背景噪声. 2个模块协同作用,增强物体的视觉表示,使模型专注于物体辨识. 为了优化定位结果,提出辅助定位损失函数,促使模型关注与音频特征匹配的图像区域. 实验结果表明,模型在MS3数据集上的mIoU为62.15,高于基线AVSegFormer模型.


关键词: 视听分割,  跨模态交互,  声源定位,  对比学习,  特征增强 
Fig.1 Structure of contrastive learning-based SSL-guided AVS model
Fig.2 Structure of feature enhancement module
方法图像编码器FPS/(帧·s?1)S4MS3
$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%
AVSegFormer-R50ResNet-50114.9785.976.4562.849.53
SSL2AVS-R50ResNet-5096.3986.576.8763.352.49
AVSegFormer-R50+ResNet-5042.5386.476.1158.043.41
SSL2AVS-R50+ResNet-5036.3386.877.1666.956.18
AVSegFormer-R50*ResNet-5030.9686.776.3865.653.81
SSL2AVS-R50*ResNet-5026.1188.078.6869.859.50
AVSegFormer-PVTPVT v281.8389.982.0669.358.36
SSL2AVS-PVTPVT v280.0690.382.4272.362.15
AVSegFormer-PVT*PVT v222.7990.583.0673.061.33
SSL2AVS-PVT*PVT v221.2091.684.4375.665.16
Tab.1 Performance comparison between SSL2AVS and baseline model
方法图像编码器S4MS3
$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%
AVSBench[9]ResNet-5084.872.8057.847.90
PVT v287.978.7064.554.00
ECMVAE[23]ResNet-5086.576.3360.748.69
PVT v290.181.7470.857.84
CATR[19]ResNet-5086.674.8065.352.80
PVT v289.681.4070.059.00
AVSC[36]ResNet-5085.277.0261.549.58
PVT v288.280.5765.158.22
AVS-UFE[32]ResNet-5087.578.9664.555.88
PVT v290.483.1570.961.95
COMBO[10]ResNet-5090.181.7066.654.50
PVT v291.984.7071.259.20
AVSegFormer[20]ResNet-5085.976.4562.849.53
PVT v289.982.0669.358.36
SSL2AVSResNet-5086.877.1666.956.18
PVT v290.382.4272.362.15
Tab.2 Performance comparison of SSL2AVS and existing AVS methods
去除/保留
最大池
化层
AVSegFormerSSL2AVS
S4MS3S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
去除76.1186.443.4158.077.1686.856.1866.9
保留76.4585.949.5362.876.8786.552.4963.3
Tab.3 Effect of characteristic size change caused by removal of pool layer on experimental results
Fig.3 Comparison of AVS results with and without pool layer, pretraining and ACT activation
帧尺寸方法从头训练S4预训练
ResNet-50PVT v2ResNet-50PVT v2
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
224×224AVSBench47.8857.854.0064.554.3357.34
ECMVAE48.6960.757.8470.857.5667.460.8172.9
AuTR49.4161.256.2167.256.0066.060.9572.5
AVS-UFE55.8864.561.9570.959.3264.47
AVSegFormer49.5362.858.3669.353.7364.360.9272.0
SSL2AVS52.4963.362.1572.359.8469.464.1674.7
SSL2AVS+56.1866.962.4972.5
512×512AVSegFormer53.8165.661.3373.054.4963.761.3773.1
SSL2AVS59.5069.865.1675.664.2774.368.4477.8
Tab.4 Performance comparison of different initialization strategies on MS3 sub-dataset
激活函数图像编码器S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
ACTResNet-5078.6888.059.5069.8
PVT v284.4391.665.1675.6
SigmoidResNet-5078.1987.855.4066.1
PVT v284.1591.361.6972.9
ResNet-5077.7687.258.7669.3
PVT v284.3391.563.3074.3
Tab.5 Effect of ACT activation on experimental results
方法图像
编码器
$N_{\mathrm{p}} / 10^6 $MS3S4
10%30%10%30%
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
AVSegFormerResNet-50126.7444.5856.947.4060.770.1384.374.8185.7
PVT v2183.9552.7564.855.1667.979.6888.881.5589.7
SSL2AVSResNet-50136.4553.9066.057.7669.271.8084.376.6787.1
PVT v2179.7259.5769.262.5872.680.9489.882.7590.8
Tab.6 Performance comparison of SSL2AVS and baseline model using a small amount of training data
特征增强图像编码器$N_{\mathrm{p}} / 10^6 $S4MS3
$ {M_{\mathrm{J}}} $/%$ {M_{\mathrm{F}}} $/%$ {M_{\mathrm{J}}} $/%$ {M_{\mathrm{F}}} $/%
ResNet-50120.7577.5987.553.3764.0
PVT v2177.9884.2491.459.9671.6
ResNet-50136.4578.6888.059.5069.8
PVT v2179.7284.4391.665.1675.6
Tab.7 Impact of feature enhancement module on model performance
Fig.4 Comparison of features with or without feature enhancement
$ {L_{{\text{loc}}}} $$ {L_{{\text{cts}}}} $S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
78.4287.957.6268.1
78.3487.959.1169.8
78.6888.059.5069.8
Tab.8 Impact of contrastive loss and location loss on model performance
融合方式S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
不进行任何融合77.2986.955.9866.9
仅视觉融合77.6687.358.8469.0
仅音频融合77.8487.257.5768.6
双向融合78.6888.059.5069.8
Tab.9 Impact of bidirectional attention fusion module on model performance
查询数量S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
10077.8787.258.5168.6
20077.8487.458.8869.0
30078.6888.059.5069.8
50077.7587.358.8969.0
Tab.10 Impact of query quantity on model performance
可学习嵌入S4MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
不添加77.4387.357.4768.0
添加78.6888.059.5069.8
Tab.11 Impact of learnable queries on model performance
编号$ {\lambda _{\text{1}}} $$ {\lambda _{\text{2}}} $$ {\lambda _{\text{3}}} $MS3
$ {M_{\text{J}}} $/%$ {M_{\text{F}}} $/%
00.000.000.0056.6767.4
10.000.000.0556.8668.1
20.000.000.1057.6268.1
30.000.000.5056.8867.1
40.000.050.1059.0769.5
50.000.100.1059.1169.8
60.000.500.1057.7268.2
70.010.100.1059.5069.8
80.050.100.1055.4366.6
90.100.100.1057.5568.1
Tab.12 Impact of loss coefficients on model performance
Fig.5 Qualitative comparison of model performance between SSL2AVS and AVSegFormer on S4 and MS3 subsets
[1]   ARANDJELOVIĆ R, ZISSERMAN A. Look, listen and learn [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 609–617.
[2]   ARANDJELOVIĆ R, ZISSERMAN A. Objects that sound [C]// Proceedings of the European Conference on Computer Vision. Murich: ECVA, 2018: 451–466.
[3]   QIAN R, HU D, DINKEL H, et al. Multiple sound sources localization from coarse to fine [C]// Proceedings of the European Conference on Computer Vision. Glasgow: ECVA, 2020: 292–308.
[4]   MO S, MORGADO P. Localizing visual sounds the easy way [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: ECVA, 2022: 218–234.
[5]   HU X, CHEN Z, OWENS A. Mix and localize: localizing sound sources in mixtures [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10473–10482.
[6]   HU D, WEI Y, QIAN R, et al Class-aware sounding objects localization via audiovisual correspondence[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 9844- 9859
[7]   MO S, TIAN Y. Audio-visual grouping network for sound localization from mixtures [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 10565–10574.
[8]   MINAEE S, BOYKOV Y, PORIKLI F, et al Image segmentation using deep learning: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (7): 3523- 3542
[9]   ZHOU J, WANG J, ZHANG J, et al. Audio–visual segmentation [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: ECVA, 2022: 386–403.
[10]   YANG Q, NIE X, LI T, et al. Cooperation does matter: exploring multi-order bilateral relations for audio-visual segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 27124–27133.
[11]   LIU J, WANG Y, JU C, et al. Annotation-free audio-visual segmentation [C]// Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2024: 5592–5602.
[12]   DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. (2020-10-22) [2025-01-10]. https://arxiv.org/abs/2010.11929.
[13]   WANG R, TANG D, DUAN N, et al. K-adapter: infusing knowledge into pre-trained models with adapters [EB/OL]. (2020-02-05) [2025-01-10]. https://arxiv.org/abs/2002.01808.
[14]   KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything [C]// Proceedings of the IEEE International Conference on Computer Vision. Paris: IEEE, 2023: 3992–4003.
[15]   WANG Y, LIU W, LI G, et al. Prompting segmentation with sound is generalizable audio-visual source localizer [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 5669–5677.
[16]   MA J, SUN P, WANG Y, et al. Stepping stones: a progressive training strategy for audio-visual semantic segmentation [C]// Proceedings of the European Conference on Computer Vision. Milan: ECVA, 2024: 311–327.
[17]   CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask Transformer for universal image segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1280–1289.
[18]   LIU J, JU C, MA C, et al. Audio-aware query-enhanced Transformer for audio-visual segmentation [EB/OL]. (2023-07-25) [2025-01-10]. https://arxiv.org/abs/2307.13236.
[19]   LI K, YANG Z, CHEN L, et al. CATR: combinatorial-dependence audio-queried Transformer for audio-visual video segmentation [C]// Proceedings of the 31st ACM International Conference on Multimedia. Ottawa: ACM, 2023: 1485–1494.
[20]   GAO S, CHEN Z, CHEN G, et al. AVSegFormer: audio-visual segmentation with Transformer [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 12155–12163.
[21]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. Long Beach: NeurIPS Foundation, 2017: 6000–6010.
[22]   XU B, LIANG H, LIANG R, et al. Locate globally, segment locally: a progressive architecture with knowledge review network for salient object detection [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2021: 3004–3012.
[23]   MAO Y, ZHANG J, XIANG M, et al. Multimodal variational auto-encoder based audio-visual segmentation [C]// Proceedings of the IEEE International Conference on Computer Vision. Paris: IEEE, 2023: 954–965.
[24]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
[25]   WANG W, XIE E, LI X, et al PVT v2: improved baselines with pyramid vision Transformer[J]. Computational Visual Media, 2022, 8 (3): 415- 424
doi: 10.1007/s41095-022-0274-8
[26]   HERSHEY S, CHAUDHURI S, ELLIS D P W, et al. CNN architectures for large-scale audio classification [C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 131–135.
[27]   RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation [C]// Proceedings of the Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 2015: 234–241.
[28]   WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. Munich: ECVA, 2018: 3–19.
[29]   ZHAO X, LIANG H, LI P, et al Motion-aware memory network for fast video salient object detection[J]. IEEE Transactions on Image Processing, 2024, 33: 709- 721
[30]   WANG Q, WU B, ZHU P, et al. ECA-net: efficient channel attention for deep convolutional neural networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531–11539.
[31]   MILLETARI F, NAVAB N, AHMADI S A. V-net: fully convolutional neural networks for volumetric medical image segmentation [C]// Fourth International Conference on 3D Vision. Stanford: IEEE, 2016: 565–571.
[32]   LIU J, LIU Y, ZHANG F, et al. Audio-visual segmentation via unlabeled frame exploitation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 26318–26329.
[33]   LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization [EB/OL]. (2017-11-14) [2025-01-10]. https://arxiv.org/abs/1711.05101.
[34]   DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 248–255.
[35]   GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: an ontology and human-labeled dataset for audio events [C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 776–780.
[1] Yue HOU,Qianhui LI,Peng YUAN,Xin ZHANG,Tiantian WANG,Ziwei HAO. Scalable traffic image auto-annotation method based on contrastive learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1634-1643.
[2] Ke CHEN,Wenhao ZHANG. Zero-shot object rumor detection based on contrastive learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1790-1800.
[3] Tian-qi ZHOU,Yan YANG,Ji-jie ZHANG,Shao-wei YIN,Zeng-qiang GUO. Graph contrastive learning based on negative-sample-free loss and adaptive augmentation[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 259-266.
[4] Yu XIE,Zi-qun BAO,Na ZHANG,Biao WU,Xiao-mei TU,Xiao-an BAO. Object detection algorithm based on feature enhancement and deep fusion[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2403-2415.
[5] Ming LI,Li-juan DUAN,Wen-jian WANG,Qing EN. Brain functional connections classification method based on significant sparse strong correlation[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2232-2240.
[6] Jia-cheng LIU,Jun-zhong JI. Classification method of fMRI data based on broad learning system[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(7): 1270-1278.
[7] Pu ZHENG,Hong-yang BAI,Wei LI,Hong-wei GUO. Small target detection algorithm in complex background[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1777-1784.