Please wait a minute...
浙江大学学报(工学版)  2025, Vol. 59 Issue (8): 1662-1670    DOI: 10.3785/j.issn.1008-973X.2025.08.013
计算机技术、控制工程、通信技术     
基于卷积和门控注意的两阶段视听语音增强算法
王盼蓉1(),贾海蓉2,*(),段淑斐2
1. 太原理工大学 集成电路学院,山西 太原 030024
2. 太原理工大学 电子信息工程学院,山西 太原 030024
Two-stage audio-visual speech enhancement algorithm based on convolution and gated attention
Panrong WANG1(),Hairong JIA2,*(),Shufei DUAN2
1. College of Integrated Circuits, Taiyuan University of Technology, Taiyuan 030024, China
2. College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
 全文: PDF(2495 KB)   HTML
摘要:

针对视听语音增强模型复杂度高且性能不佳的问题,提出基于卷积和门控注意的两阶段视听语音增强算法. 采用分块混合的门控注意单元(GAU)为特征融合网络主干,使用简化的单头注意力机制及应用块内二次-块间线性注意力,降低复杂度并捕获音视频序列的全局依赖关系. 在GAU中融入卷积模块,利用逐深度卷积和点卷积对音视频块内-块间局部特征进行提取,捕获音视频序列的局部依赖关系. 卷积与注意力相结合,可以显著提升模型在处理音视频序列时的性能. 为了利用2种模态包含的语音信息,采用两阶段算法,第1阶段音频作为主导模态,视频作为条件模态. 第2阶段视频作为主导模态,第1阶段提取的音频作为条件模态. 实验结果表明,提出的模型较现有模型在PESQ和SNR指标上均有显著提高,有效降低了复杂度.

关键词: 语音增强卷积注意力门控注意单元(GAU)多模态    
Abstract:

A two-stage audio-visual speech enhancement algorithm based on convolution and gated attention was proposed in order to solve the problem of high complexity and poor performance of audio-visual speech enhancement model. The block-mixed gated attention unit (GAU) was utilized as the backbone of feature fusion network by using a simplified single head attention mechanism and applying intra block quadratic-inter block linear attention in order to reduce complexity and capture global dependencies of audio-video sequences. Convolutional module was integrated into GAU. Depthwise convolution and pointwise convolution were used to extract local features within and between audio-video blocks in order to capture local dependencies of audio-video sequences. The combination of convolution and attention can significantly improve the performance of the model in processing audio-video sequences. A two-stage algorithm was adopted in order to utilize the speech information contained in the two modalities. Audio was used as the dominant modality and video was used as the conditional modality in the first stage. Video in the second stage was used as the dominant modality and audio extracted in the first stage was used as the conditional modality. The experimental results show that proposed model significantly improves both PESQ and SNR metrics compared with existing models, effectively reducing complexity.

Key words: speech enhancement    convolution    attention    gated attention unit (GAU)    multimodality
收稿日期: 2024-07-15 出版日期: 2025-07-28
:  TN 912  
基金资助: 国家自然科学基金资助项目(12004275);山西省自然科学基金资助项目(20210302123186);山西省回国留学人员科研资助项目(2024-060).
通讯作者: 贾海蓉     E-mail: wangpanrong2021@163.com;helenjia722@163.com
作者简介: 王盼蓉(2000—),女,硕士生,从事语音信号处理的研究. orcid.org/0009-0004-1665-8760. E-mail:wangpanrong2021@163.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
王盼蓉
贾海蓉
段淑斐

引用本文:

王盼蓉,贾海蓉,段淑斐. 基于卷积和门控注意的两阶段视听语音增强算法[J]. 浙江大学学报(工学版), 2025, 59(8): 1662-1670.

Panrong WANG,Hairong JIA,Shufei DUAN. Two-stage audio-visual speech enhancement algorithm based on convolution and gated attention. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1662-1670.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2025.08.013        https://www.zjujournals.com/eng/CN/Y2025/V59/I8/1662

图 1  基于卷积和门控注意的两阶段视听语音增强算法的框架图
图 2  卷积模块的结构图
图 3  两阶段CNN-GAU模块的结构图
参数数值
Conv1D卷积核大小L16
块长度P80
双阶段CNN-GAU模块个数R8
${{\boldsymbol{X}}'_{\text{U}}}$${{\boldsymbol{X}}'_{\text{I}}}$${{\boldsymbol{V}}_{\text{U}}}$特征维度M512
${{\boldsymbol{X}}'_{\text{Z}}}$${{\boldsymbol{V}}_{\text{Z}}}$特征维度D128
表 1  基于卷积和门控注意的两阶段视听语音增强算法的实验参数设置
环境配置参数环境配置参数
CPUAMD Ryzen 7 3800X显存10 GB
主频3.89 GHzIDE环境Pycharm
内存48 GB编译语言Python 3.8
GPUNVIDIA GeForce RTX 3080
表 2  基于卷积和门控注意的两阶段视听语音增强算法的实验环境
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
Noisy1.541.671.862.152.47?10?50510
AV-ConvTasnet2.132.172.322.362.403.984.264.444.424.31
MuSE2.192.252.372.422.464.575.766.356.416.29
AV-Sepformer2.532.592.752.822.886.476.586.836.896.78
本文模型2.772.793.013.163.2512.2913.0914.1114.5414.55
表 3  风雨噪声5种信噪比下各模型的语音增强效果对比
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
Noisy1.762.092.282.372.55?10?50510
AV-ConvTasnet2.272.332.422.472.523.973.964.204.044.06
MuSE2.322.422.542.622.675.926.136.476.246.17
AV-Sepformer2.692.742.842.902.936.756.586.896.776.81
本文模型2.963.103.283.393.4713.9314.0714.8014.5814.68
表 4  救护车噪声5种信噪比下各模型的语音增强效果对比
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
Noisy1.701.922.252.422.64?10?50510
AV-ConvTasnet2.352.322.552.602.654.394.504.504.294.32
MuSE2.352.552.652.692.736.276.806.916.476.38
AV-Sepformer2.672.812.862.922.936.776.796.826.626.76
本文模型2.883.083.253.353.4512.8913.5714.3314.1214.39
表 5  闹钟噪声5种信噪比下各模型的语音增强效果对比
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
实验12.532.592.752.822.886.476.586.836.896.78
实验22.562.632.782.932.9810.5211.7312.3313.3313.80
实验32.702.752.973.133.2211.8412.9514.0314.6414.68
实验42.772.793.013.163.2512.2913.0914.1114.5414.55
表 6  风雨噪声下不同模块对增强语音PESQ和SNR的影响
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
实验12.692.742.842.902.936.756.586.896.776.81
实验22.882.983.183.273.3913.7713.6814.7714.6614.81
实验32.943.063.263.383.4714.0614.2314.9714.7614.89
实验42.963.103.283.393.4713.9314.0714.8014.5814.68
表 7  救护车噪声下不同模块对增强语音PESQ和SNR的影响
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
实验12.672.812.862.922.936.776.796.826.626.76
实验22.812.993.153.243.2912.8813.3913.9913.9914.27
实验32.853.053.223.313.4212.8313.6714.4814.2714.54
实验42.883.083.253.353.4512.8913.5714.3314.1214.39
表 8  闹钟噪声下不同模块对增强语音PESQ和SNR的影响
模型参数量/106计算量/109
AV-ConvTasnet11.0322.08
MuSE15.0125.88
AV-Sepformer29.63141.83
实验25.2236.71
实验310.5268.27
实验413.1678.63
表 9  各个模型的复杂度对比
图 4  不同模型在−5 dB风雨噪声上的增强语音语谱图对比
1 张睿, 张鹏云, 孙超利 基于多域融合及神经架构搜索的语音增强方法[J]. 通信学报, 2024, 45 (2): 225- 239
ZHANG Rui, ZHANG Pengyun, SUN Chaoli Speech enhancement method based on multidomain fusion and neural architecture search[J]. Journal on Communications, 2024, 45 (2): 225- 239
doi: 10.11959/j.issn.1000-436x.2024018
2 纪鹏威, 全海燕 基于双生成器与频域判别器GAN语音增强算法[J]. 云南大学学报: 自然科学版, 2024, 46 (5): 871- 880
JI Pengwei, QUAN Haiyan GAN speech enhancement algorithm based on twin synthesizer and frequency domain discriminator[J]. Journal of Yunnan University: Natural Sciences Edition, 2024, 46 (5): 871- 880
3 AFOURAS T, CHUNG J S, ZISSERMAN A. The conversation: deep audio-visual speech enhancement [C]// Interspeech. Hyderabad: Curran Associates, 2018: 3244-3248.
4 AFOURAS T, CHUNG J S, ZISSERMAN A. My lips are concealed: audio-visual speech enhancement through obstructions [C]// Interspeech. Hyderabad: Curran Associates, 2019: 4295-4299.
5 MICHELSANTI D, TAN Z H, SIGURDSSON S, et al Deep-learning-based audio-visual speech enhancement in presence of Lombard effect[J]. Speech Communication, 2019, 115: 38- 50
doi: 10.1016/j.specom.2019.10.006
6 GOGATE M, DASHTIPOUR K, ADEEL A, et al CochleaNet: a robust language-independent audio-visual model for real-time speech enhancement[J]. Information Fusion, 2020, 63 (1): 273- 285
7 HOU J C, WANG S S, LAI Y H, et al Audio-visual speech enhancement using multimodal deep convolutional neural networks[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2018, 2 (2): 117- 128
doi: 10.1109/TETCI.2017.2784878
8 GABBAY A, SHAMIR A, PELEG S. Visual speech enhancement [C]// Interspeech. Hyderabad: Curran Associates, 2018: 1170-1174.
9 WU J, XU Y, ZHANG S X, et al. Time domain audio visual speech separation [C]//IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 667-673.
10 PAN Z, TAO R, XU C, et al. Muse: multi-modal target speaker extraction with visual cues [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6678-6682.
11 LIN J, CAI X, DINKEL H, et al. Av-Sepformer: cross-attention Sepformer for audio-visual target speaker extraction [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
12 HUA W, DAI Z, LIU H, et al. Transformer quality in linear time [C]// International Conference on Machine Learning. Baltimore: [s. n. ], 2022: 9099-9117.
13 LUO Y, CHEN Z, YOSHIOKA T. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46-50.
14 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
15 ZHAO S, MA B. Mossformer: pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
16 VASWANI A, SHAZEER N, PARMAR N, et al Attention is all you need[J]. Advances in Neural Information Processing System, 2017, 30 (1): 261- 272
17 SHAZEER N. Glu variants improve transformer [EB/OL]. (2020-02-12)[2024-07-15]. https://arxiv.org/pdf/2002.05202.
18 SU J, AHMED M, LU Y, et al Roformer: enhanced transformer with rotary position embedding[J]. Neurocomputing, 2024, 568: 127063
doi: 10.1016/j.neucom.2023.127063
19 PENG Y, DALMIA S, LANE I, et al. Branchformer: parallel mlp-attention architectures to capture local and global context for speech recognition and understanding [C]// International Conference on Machine Learning. Baltimore: [s. n. ], 2022: 17627-17643.
20 MU Z, YANG X. Separate in the speech chain: cross-modal conditional audio-visual target speech extraction [EB/OL]. (2024-05-05)[2024-07-15]. https://arxiv.org/pdf/2404.12725.
[1] 张学军,梁书滨,白万荣,张奉鹤,黄海燕,郭梅凤,陈卓. 基于异构图表征的源代码漏洞检测方法[J]. 浙江大学学报(工学版), 2025, 59(8): 1644-1652.
[2] 林宜山,左景,卢树华. 基于多头自注意力机制与MLP-Interactor的多模态情感分析[J]. 浙江大学学报(工学版), 2025, 59(8): 1653-1661.
[3] 杨荣泰,邵玉斌,杜庆治. 基于结构感知的少样本知识补全[J]. 浙江大学学报(工学版), 2025, 59(7): 1394-1402.
[4] 项新建,袁天顺,何亚强,汪成立. 基于时序分解和软阈值时间卷积的交通流预测[J]. 浙江大学学报(工学版), 2025, 59(7): 1353-1361.
[5] 魏新雨,饶蕾,范光宇,陈年生,程松林,杨定裕. 用于无人机遥感图像的高精度实时语义分割网络[J]. 浙江大学学报(工学版), 2025, 59(7): 1411-1420.
[6] 杨宇豪,郭永存,李德永,王爽. 基于视觉信息的煤矸识别分割定位方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1421-1433.
[7] 刘杰,吴优,田佳禾,韩轲. 改进Transformer的肺部CT图像超分辨率重建[J]. 浙江大学学报(工学版), 2025, 59(7): 1434-1442.
[8] 何婧瑶,李鹏飞,汪承志,吕振鸣,牟萍. 基于双目视觉和改进YOLOv8的动态三维重建方法[J]. 浙江大学学报(工学版), 2025, 59(7): 1443-1450.
[9] 王圣举,张赞. 基于加速扩散模型的缺失值插补算法[J]. 浙江大学学报(工学版), 2025, 59(7): 1471-1480.
[10] 章东平,王大为,何数技,汤斯亮,刘志勇,刘中秋. 基于跨维度特征融合的航空发动机寿命预测[J]. 浙江大学学报(工学版), 2025, 59(7): 1504-1513.
[11] 蔡永青,韩成,权巍,陈兀迪. 基于注意力机制的视觉诱导晕动症评估模型[J]. 浙江大学学报(工学版), 2025, 59(6): 1110-1118.
[12] 杨燕,晁丽鹏. 基于多维协同注意力的双支特征联合去雾网络[J]. 浙江大学学报(工学版), 2025, 59(6): 1119-1129.
[13] 鞠文博,董华军. 基于上下文信息融合与动态采样的主板缺陷检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1159-1168.
[14] 梁耕良,韩曙光. 基于改进RT-DETR的牛仔面料疵点检测算法[J]. 浙江大学学报(工学版), 2025, 59(6): 1169-1178.
[15] 王立红,刘新倩,李静,冯志全. 基于联邦学习和时空特征融合的网络入侵检测方法[J]. 浙江大学学报(工学版), 2025, 59(6): 1201-1210.