Contrastive learning-based sound source localization-guided audio-visual segmentation model

doi:10.3785/j.issn.1008-973X.2025.09.004

Journal of ZheJiang University (Engineering Science)

2025, Vol. 59

Issue (9): 1803-1813 DOI: 10.3785/j.issn.1008-973X.2025.09.004

Contrastive learning-based sound source localization-guided audio-visual segmentation model

Wenhu HUANG(

),Xing ZHAO*(

),Liang XIE,Haoran LIANG,Ronghua LIANG

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China

Download:

HTML

PDF(2271KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

A sound source localization-guided audio-visual segmentation (SSL2AVS) model based on contrastive learning was proposed to address the problem that background noise hindered effective information exchange and object discrimination in audio-visual segmentation (AVS) tasks. A two-stage localization-to-segmentation progressive strategy was adopted, where visual features were refined through sound source localization to suppress background interference, making the model suitable for audio-visual segmentation in complex scenes. Prior to segmentation, a target localization module was introduced to align audio-visual modalities via the contrastive learning method, generating sound source heatmaps to achieve preliminary sound source localization. A multi-scale feature pyramid network incorporating a feature enhancement module was constructed to dynamically weight and fuse the shallow spatial detail features and the deep semantic features based on the localization results, effectively amplifying the visual features of target objects while suppressing background noise. The synergistic operation of the two modules improved visual representations of objects and enabled the model to focus on object identification. An auxiliary localization loss function was proposed to optimize localization results by encouraging the model to focus on the image regions that matched audio features. Experimental results on the MS3 dataset demonstrated that the model achieved a mean Intersection over Union (mIoU) of 62.15, surpassing the baseline AVSegFormer model.

Key words： audio-visual segmentation cross-modal interaction sound source localization contrastive learning feature enhancement

Received: 04 December 2024 Published: 25 August 2025

CLC:

TP 391.4

Fund: 国家自然科学基金资助项目(62402441, 62432014, 62176235); 浙江省自然科学基金资助项目(LDT23F0202, LDT23F02021F02).

Corresponding Authors: Xing ZHAO E-mail: 211123120094@zjut.edu.cn;xing@zjut.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Wenhu HUANG
	Xing ZHAO
	Liang XIE
	Haoran LIANG
	Ronghua LIANG

Cite this article:

Wenhu HUANG,Xing ZHAO,Liang XIE,Haoran LIANG,Ronghua LIANG. Contrastive learning-based sound source localization-guided audio-visual segmentation model. Journal of ZheJiang University (Engineering Science), 2025, 59(9): 1803-1813.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.09.004 OR https://www.zjujournals.com/eng/Y2025/V59/I9/1803

基于对比学习的声源定位引导视听分割模型

针对视听分割任务中背景噪声阻碍有效信息交互和物体辨别的问题，提出基于对比学习的声源定位引导视听分割模型（SSL2AVS）. 采用从定位到分割的两阶段策略，通过声源定位引导视觉特征优化，从而减少背景噪声干扰，使模型适用于复杂场景中的视听分割. 在分割前引入目标定位模块，利用对比学习方法对齐视听模态并生成声源热力图，实现发声物体粗定位；引入特征增强模块，构建多尺度特征金字塔网络，利用定位结果动态地加权融合浅层空间细节特征与深层语义特征，在引导增强目标物体视觉特征的同时抑制背景噪声. 2个模块协同作用，增强物体的视觉表示，使模型专注于物体辨识. 为了优化定位结果，提出辅助定位损失函数，促使模型关注与音频特征匹配的图像区域. 实验结果表明，模型在MS3数据集上的mIoU为62.15，高于基线AVSegFormer模型.

关键词： 视听分割, 跨模态交互, 声源定位, 对比学习, 特征增强

Fig.1 Structure of contrastive learning-based SSL-guided AVS model

Fig.2 Structure of feature enhancement module

Tab.1 Performance comparison between SSL2AVS and baseline model

Tab.2 Performance comparison of SSL2AVS and existing AVS methods

Tab.3 Effect of characteristic size change caused by removal of pool layer on experimental results

Fig.3 Comparison of AVS results with and without pool layer, pretraining and ACT activation

Tab.4 Performance comparison of different initialization strategies on MS3 sub-dataset

Tab.5 Effect of ACT activation on experimental results

Tab.6 Performance comparison of SSL2AVS and baseline model using a small amount of training data

Tab.7 Impact of feature enhancement module on model performance

Fig.4 Comparison of features with or without feature enhancement

Tab.8 Impact of contrastive loss and location loss on model performance

Tab.9 Impact of bidirectional attention fusion module on model performance

Tab.10 Impact of query quantity on model performance

Tab.11 Impact of learnable queries on model performance

Tab.12 Impact of loss coefficients on model performance

Fig.5 Qualitative comparison of model performance between SSL2AVS and AVSegFormer on S4 and MS3 subsets


[1]	ARANDJELOVIĆ R, ZISSERMAN A. Look, listen and learn [C]// Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 609–617.

[2]	ARANDJELOVIĆ R, ZISSERMAN A. Objects that sound [C]// Proceedings of the European Conference on Computer Vision. Murich: ECVA, 2018: 451–466.

[3]	QIAN R, HU D, DINKEL H, et al. Multiple sound sources localization from coarse to fine [C]// Proceedings of the European Conference on Computer Vision. Glasgow: ECVA, 2020: 292–308.

[4]	MO S, MORGADO P. Localizing visual sounds the easy way [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: ECVA, 2022: 218–234.

[5]	HU X, CHEN Z, OWENS A. Mix and localize: localizing sound sources in mixtures [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 10473–10482.

[6]	HU D, WEI Y, QIAN R, et al Class-aware sounding objects localization via audiovisual correspondence[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (12): 9844- 9859

[7]	MO S, TIAN Y. Audio-visual grouping network for sound localization from mixtures [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 10565–10574.

[8]	MINAEE S, BOYKOV Y, PORIKLI F, et al Image segmentation using deep learning: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (7): 3523- 3542

[9]	ZHOU J, WANG J, ZHANG J, et al. Audio–visual segmentation [C]// Proceedings of the European Conference on Computer Vision. Tel Aviv: ECVA, 2022: 386–403.

[10]	YANG Q, NIE X, LI T, et al. Cooperation does matter: exploring multi-order bilateral relations for audio-visual segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 27124–27133.

[11]	LIU J, WANG Y, JU C, et al. Annotation-free audio-visual segmentation [C]// Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2024: 5592–5602.

[12]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. (2020-10-22) [2025-01-10]. https://arxiv.org/abs/2010.11929.

[13]	WANG R, TANG D, DUAN N, et al. K-adapter: infusing knowledge into pre-trained models with adapters [EB/OL]. (2020-02-05) [2025-01-10]. https://arxiv.org/abs/2002.01808.

[14]	KIRILLOV A, MINTUN E, RAVI N, et al. Segment anything [C]// Proceedings of the IEEE International Conference on Computer Vision. Paris: IEEE, 2023: 3992–4003.

[15]	WANG Y, LIU W, LI G, et al. Prompting segmentation with sound is generalizable audio-visual source localizer [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 5669–5677.

[16]	MA J, SUN P, WANG Y, et al. Stepping stones: a progressive training strategy for audio-visual semantic segmentation [C]// Proceedings of the European Conference on Computer Vision. Milan: ECVA, 2024: 311–327.

[17]	CHENG B, MISRA I, SCHWING A G, et al. Masked-attention mask Transformer for universal image segmentation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 1280–1289.

[18]	LIU J, JU C, MA C, et al. Audio-aware query-enhanced Transformer for audio-visual segmentation [EB/OL]. (2023-07-25) [2025-01-10]. https://arxiv.org/abs/2307.13236.

[19]	LI K, YANG Z, CHEN L, et al. CATR: combinatorial-dependence audio-queried Transformer for audio-visual video segmentation [C]// Proceedings of the 31st ACM International Conference on Multimedia. Ottawa: ACM, 2023: 1485–1494.

[20]	GAO S, CHEN Z, CHEN G, et al. AVSegFormer: audio-visual segmentation with Transformer [C]// Proceedings of the AAAI Conference on Artificial Intelligence. Vancouver: AAAI, 2024: 12155–12163.

[21]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems. Long Beach: NeurIPS Foundation, 2017: 6000–6010.

[22]	XU B, LIANG H, LIANG R, et al. Locate globally, segment locally: a progressive architecture with knowledge review network for salient object detection [C]// Proceedings of the AAAI Conference on Artificial Intelligence. [S.l.]: AAAI, 2021: 3004–3012.

[23]	MAO Y, ZHANG J, XIANG M, et al. Multimodal variational auto-encoder based audio-visual segmentation [C]// Proceedings of the IEEE International Conference on Computer Vision. Paris: IEEE, 2023: 954–965.

[24]	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.

[25]	WANG W, XIE E, LI X, et al PVT v2: improved baselines with pyramid vision Transformer[J]. Computational Visual Media, 2022, 8 (3): 415- 424 doi: 10.1007/s41095-022-0274-8

[26]	HERSHEY S, CHAUDHURI S, ELLIS D P W, et al. CNN architectures for large-scale audio classification [C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 131–135.

[27]	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation [C]// Proceedings of the Medical Image Computing and Computer-Assisted Intervention. Munich: Springer, 2015: 234–241.

[28]	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. Munich: ECVA, 2018: 3–19.

[29]	ZHAO X, LIANG H, LI P, et al Motion-aware memory network for fast video salient object detection[J]. IEEE Transactions on Image Processing, 2024, 33: 709- 721

[30]	WANG Q, WU B, ZHU P, et al. ECA-net: efficient channel attention for deep convolutional neural networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020: 11531–11539.

[31]	MILLETARI F, NAVAB N, AHMADI S A. V-net: fully convolutional neural networks for volumetric medical image segmentation [C]// Fourth International Conference on 3D Vision. Stanford: IEEE, 2016: 565–571.

[32]	LIU J, LIU Y, ZHANG F, et al. Audio-visual segmentation via unlabeled frame exploitation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2024: 26318–26329.

[33]	LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization [EB/OL]. (2017-11-14) [2025-01-10]. https://arxiv.org/abs/1711.05101.

[34]	DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 248–255.

[35]	GEMMEKE J F, ELLIS D P W, FREEDMAN D, et al. Audio set: an ontology and human-labeled dataset for audio events [C]// Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 776–780.

[1]	Yue HOU,Qianhui LI,Peng YUAN,Xin ZHANG,Tiantian WANG,Ziwei HAO. Scalable traffic image auto-annotation method based on contrastive learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1634-1643.

[2]	Ke CHEN,Wenhao ZHANG. Zero-shot object rumor detection based on contrastive learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(9): 1790-1800.

[3]	Tian-qi ZHOU,Yan YANG,Ji-jie ZHANG,Shao-wei YIN,Zeng-qiang GUO. Graph contrastive learning based on negative-sample-free loss and adaptive augmentation[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(2): 259-266.

[4]	Yu XIE,Zi-qun BAO,Na ZHANG,Biao WU,Xiao-mei TU,Xiao-an BAO. Object detection algorithm based on feature enhancement and deep fusion[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2403-2415.

[5]	Ming LI,Li-juan DUAN,Wen-jian WANG,Qing EN. Brain functional connections classification method based on significant sparse strong correlation[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(11): 2232-2240.

[6]	Jia-cheng LIU,Jun-zhong JI. Classification method of fMRI data based on broad learning system[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(7): 1270-1278.

[7]	Pu ZHENG,Hong-yang BAI,Wei LI,Hong-wei GUO. Small target detection algorithm in complex background[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(9): 1777-1784.

Viewed

Full text

Abstract

Cited

Shared

Discussed