基于卷积和门控注意的两阶段视听语音增强算法

doi:10.3785/j.issn.1008-973X.2025.08.013

浙江大学学报(工学版)

2025, Vol. 59

Issue (8): 1662-1670 DOI: 10.3785/j.issn.1008-973X.2025.08.013

计算机技术、控制工程、通信技术

基于卷积和门控注意的两阶段视听语音增强算法

王盼蓉1(

),贾海蓉2,*(

),段淑斐2

1. 太原理工大学集成电路学院，山西太原 030024
2. 太原理工大学电子信息工程学院，山西太原 030024

Two-stage audio-visual speech enhancement algorithm based on convolution and gated attention

Panrong WANG1(

),Hairong JIA2,*(

),Shufei DUAN2

1. College of Integrated Circuits, Taiyuan University of Technology, Taiyuan 030024, China
2. College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China

全文: PDF(2495 KB) HTML

摘要：

针对视听语音增强模型复杂度高且性能不佳的问题，提出基于卷积和门控注意的两阶段视听语音增强算法. 采用分块混合的门控注意单元（GAU）为特征融合网络主干，使用简化的单头注意力机制及应用块内二次-块间线性注意力，降低复杂度并捕获音视频序列的全局依赖关系. 在GAU中融入卷积模块，利用逐深度卷积和点卷积对音视频块内-块间局部特征进行提取，捕获音视频序列的局部依赖关系. 卷积与注意力相结合，可以显著提升模型在处理音视频序列时的性能. 为了利用2种模态包含的语音信息，采用两阶段算法，第1阶段音频作为主导模态，视频作为条件模态. 第2阶段视频作为主导模态，第1阶段提取的音频作为条件模态. 实验结果表明，提出的模型较现有模型在PESQ和SNR指标上均有显著提高，有效降低了复杂度.

关键词： 语音增强; 卷积; 注意力; 门控注意单元(GAU); 多模态

Abstract:

A two-stage audio-visual speech enhancement algorithm based on convolution and gated attention was proposed in order to solve the problem of high complexity and poor performance of audio-visual speech enhancement model. The block-mixed gated attention unit (GAU) was utilized as the backbone of feature fusion network by using a simplified single head attention mechanism and applying intra block quadratic-inter block linear attention in order to reduce complexity and capture global dependencies of audio-video sequences. Convolutional module was integrated into GAU. Depthwise convolution and pointwise convolution were used to extract local features within and between audio-video blocks in order to capture local dependencies of audio-video sequences. The combination of convolution and attention can significantly improve the performance of the model in processing audio-video sequences. A two-stage algorithm was adopted in order to utilize the speech information contained in the two modalities. Audio was used as the dominant modality and video was used as the conditional modality in the first stage. Video in the second stage was used as the dominant modality and audio extracted in the first stage was used as the conditional modality. The experimental results show that proposed model significantly improves both PESQ and SNR metrics compared with existing models, effectively reducing complexity.

Key words: speech enhancement convolution attention gated attention unit (GAU) multimodality

收稿日期: 2024-07-15 出版日期: 2025-07-28

TN 912

基金资助: 国家自然科学基金资助项目（12004275）；山西省自然科学基金资助项目（20210302123186）；山西省回国留学人员科研资助项目（2024-060）.

通讯作者: 贾海蓉 E-mail: wangpanrong2021@163.com;helenjia722@163.com

作者简介: 王盼蓉（2000—），女，硕士生，从事语音信号处理的研究. orcid.org/0009-0004-1665-8760. E-mail：wangpanrong2021@163.com

	服务
	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	作者相关文章
	王盼蓉
	贾海蓉
	段淑斐

引用本文:

王盼蓉,贾海蓉,段淑斐. 基于卷积和门控注意的两阶段视听语音增强算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1662-1670.

Panrong WANG,Hairong JIA,Shufei DUAN. Two-stage audio-visual speech enhancement algorithm based on convolution and gated attention. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1662-1670.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.08.013 或 https://www.zjujournals.com/eng/CN/Y2025/V59/I8/1662

图 1 基于卷积和门控注意的两阶段视听语音增强算法的框架图

图 2 卷积模块的结构图

图 3 两阶段CNN-GAU模块的结构图

表 1 基于卷积和门控注意的两阶段视听语音增强算法的实验参数设置

表 2 基于卷积和门控注意的两阶段视听语音增强算法的实验环境

表 3 风雨噪声5种信噪比下各模型的语音增强效果对比

表 4 救护车噪声5种信噪比下各模型的语音增强效果对比

表 5 闹钟噪声5种信噪比下各模型的语音增强效果对比

表 6 风雨噪声下不同模块对增强语音PESQ和SNR的影响

表 7 救护车噪声下不同模块对增强语音PESQ和SNR的影响

表 8 闹钟噪声下不同模块对增强语音PESQ和SNR的影响

表 9 各个模型的复杂度对比

图 4 不同模型在−5 dB风雨噪声上的增强语音语谱图对比

1	张睿, 张鹏云, 孙超利基于多域融合及神经架构搜索的语音增强方法[J]. 通信学报, 2024, 45 (2): 225- 239 ZHANG Rui, ZHANG Pengyun, SUN Chaoli Speech enhancement method based on multidomain fusion and neural architecture search[J]. Journal on Communications, 2024, 45 (2): 225- 239 doi: 10.11959/j.issn.1000-436x.2024018
2	纪鹏威, 全海燕基于双生成器与频域判别器GAN语音增强算法[J]. 云南大学学报: 自然科学版, 2024, 46 (5): 871- 880 JI Pengwei, QUAN Haiyan GAN speech enhancement algorithm based on twin synthesizer and frequency domain discriminator[J]. Journal of Yunnan University: Natural Sciences Edition, 2024, 46 (5): 871- 880
3	AFOURAS T, CHUNG J S, ZISSERMAN A. The conversation: deep audio-visual speech enhancement [C]// Interspeech. Hyderabad: Curran Associates, 2018: 3244-3248.
4	AFOURAS T, CHUNG J S, ZISSERMAN A. My lips are concealed: audio-visual speech enhancement through obstructions [C]// Interspeech. Hyderabad: Curran Associates, 2019: 4295-4299.
5	MICHELSANTI D, TAN Z H, SIGURDSSON S, et al Deep-learning-based audio-visual speech enhancement in presence of Lombard effect[J]. Speech Communication, 2019, 115: 38- 50 doi: 10.1016/j.specom.2019.10.006
6	GOGATE M, DASHTIPOUR K, ADEEL A, et al CochleaNet: a robust language-independent audio-visual model for real-time speech enhancement[J]. Information Fusion, 2020, 63 (1): 273- 285
7	HOU J C, WANG S S, LAI Y H, et al Audio-visual speech enhancement using multimodal deep convolutional neural networks[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2018, 2 (2): 117- 128 doi: 10.1109/TETCI.2017.2784878
8	GABBAY A, SHAMIR A, PELEG S. Visual speech enhancement [C]// Interspeech. Hyderabad: Curran Associates, 2018: 1170-1174.
9	WU J, XU Y, ZHANG S X, et al. Time domain audio visual speech separation [C]//IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 667-673.
10	PAN Z, TAO R, XU C, et al. Muse: multi-modal target speaker extraction with visual cues [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6678-6682.
11	LIN J, CAI X, DINKEL H, et al. Av-Sepformer: cross-attention Sepformer for audio-visual target speaker extraction [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
12	HUA W, DAI Z, LIU H, et al. Transformer quality in linear time [C]// International Conference on Machine Learning. Baltimore: [s. n. ], 2022: 9099-9117.
13	LUO Y, CHEN Z, YOSHIOKA T. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46-50.
14	HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
15	ZHAO S, MA B. Mossformer: pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
16	VASWANI A, SHAZEER N, PARMAR N, et al Attention is all you need[J]. Advances in Neural Information Processing System, 2017, 30 (1): 261- 272
17	SHAZEER N. Glu variants improve transformer [EB/OL]. (2020-02-12)[2024-07-15]. https://arxiv.org/pdf/2002.05202.
18	SU J, AHMED M, LU Y, et al Roformer: enhanced transformer with rotary position embedding[J]. Neurocomputing, 2024, 568: 127063 doi: 10.1016/j.neucom.2023.127063
19	PENG Y, DALMIA S, LANE I, et al. Branchformer: parallel mlp-attention architectures to capture local and global context for speech recognition and understanding [C]// International Conference on Machine Learning. Baltimore: [s. n. ], 2022: 17627-17643.
20	MU Z, YANG X. Separate in the speech chain: cross-modal conditional audio-visual target speech extraction [EB/OL]. (2024-05-05)[2024-07-15]. https://arxiv.org/pdf/2404.12725.

[1]	张学军,梁书滨,白万荣,张奉鹤,黄海燕,郭梅凤,陈卓. 基于异构图表征的源代码漏洞检测方法[J]. 浙江大学学报(工学版), 2025, 59(8): 1644-1652.
[2]	林宜山,左景,卢树华. 基于多头自注意力机制与MLP-Interactor的多模态情感分析[J]. 浙江大学学报(工学版), 2025, 59(8): 1653-1661.
[3]	杨荣泰,邵玉斌,杜庆治. 基于结构感知的少样本知识补全[J]. 浙江大学学报(工学版), 2025, 59(7): 1394-1402.
[4]	项新建,袁天顺,何亚强,汪成立. 基于时序分解和软阈值时间卷积的交通流预测[J]. 浙江大学学报(工学版), 2025, 59(7): 1353-1361.
[5]	魏新雨,饶蕾,范光宇,陈年生,程松林,杨定裕. 用于无人机遥感图像的高精度实时语义分割网络[J]. 浙江大学学报(工学版), 2025, 59(7): 1411-1420.
[6]	杨宇豪,郭永存,李德永,王爽. 基于视觉信息的煤矸识别分割定位方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1421-1433.
[7]	刘杰,吴优,田佳禾,韩轲. 改进Transformer的肺部CT图像超分辨率重建[J]. 浙江大学学报(工学版), 2025, 59(7): 1434-1442.
[8]	何婧瑶,李鹏飞,汪承志,吕振鸣,牟萍. 基于双目视觉和改进YOLOv8的动态三维重建方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1443-1450.
[9]	王圣举,张赞. 基于加速扩散模型的缺失值插补算法[J]. 浙江大学学报(工学版), 2025, 59(7): 1471-1480.
[10]	章东平,王大为,何数技,汤斯亮,刘志勇,刘中秋. 基于跨维度特征融合的航空发动机寿命预测[J]. 浙江大学学报(工学版), 2025, 59(7): 1504-1513.
[11]	蔡永青,韩成,权巍,陈兀迪. 基于注意力机制的视觉诱导晕动症评估模型[J]. 浙江大学学报(工学版), 2025, 59(6): 1110-1118.
[12]	杨燕,晁丽鹏. 基于多维协同注意力的双支特征联合去雾网络[J]. 浙江大学学报(工学版), 2025, 59(6): 1119-1129.
[13]	鞠文博,董华军. 基于上下文信息融合与动态采样的主板缺陷检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1159-1168.
[14]	梁耕良,韩曙光. 基于改进RT-DETR的牛仔面料疵点检测算法[J]. 浙江大学学报(工学版), 2025, 59(6): 1169-1178.
[15]	王立红,刘新倩,李静,冯志全. 基于联邦学习和时空特征融合的网络入侵检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1201-1210.

Viewed

Full text

Abstract

Cited

Shared

Discussed