Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2025, Vol. 59 Issue (8): 1662-1670    DOI: 10.3785/j.issn.1008-973X.2025.08.013
    
Two-stage audio-visual speech enhancement algorithm based on convolution and gated attention
Panrong WANG1(),Hairong JIA2,*(),Shufei DUAN2
1. College of Integrated Circuits, Taiyuan University of Technology, Taiyuan 030024, China
2. College of Electronic Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
Download: HTML     PDF(2495KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A two-stage audio-visual speech enhancement algorithm based on convolution and gated attention was proposed in order to solve the problem of high complexity and poor performance of audio-visual speech enhancement model. The block-mixed gated attention unit (GAU) was utilized as the backbone of feature fusion network by using a simplified single head attention mechanism and applying intra block quadratic-inter block linear attention in order to reduce complexity and capture global dependencies of audio-video sequences. Convolutional module was integrated into GAU. Depthwise convolution and pointwise convolution were used to extract local features within and between audio-video blocks in order to capture local dependencies of audio-video sequences. The combination of convolution and attention can significantly improve the performance of the model in processing audio-video sequences. A two-stage algorithm was adopted in order to utilize the speech information contained in the two modalities. Audio was used as the dominant modality and video was used as the conditional modality in the first stage. Video in the second stage was used as the dominant modality and audio extracted in the first stage was used as the conditional modality. The experimental results show that proposed model significantly improves both PESQ and SNR metrics compared with existing models, effectively reducing complexity.



Key wordsspeech enhancement      convolution      attention      gated attention unit (GAU)      multimodality     
Received: 15 July 2024      Published: 28 July 2025
CLC:  TN 912  
Fund:  国家自然科学基金资助项目(12004275);山西省自然科学基金资助项目(20210302123186);山西省回国留学人员科研资助项目(2024-060).
Corresponding Authors: Hairong JIA     E-mail: wangpanrong2021@163.com;helenjia722@163.com
Cite this article:

Panrong WANG,Hairong JIA,Shufei DUAN. Two-stage audio-visual speech enhancement algorithm based on convolution and gated attention. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1662-1670.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2025.08.013     OR     https://www.zjujournals.com/eng/Y2025/V59/I8/1662


基于卷积和门控注意的两阶段视听语音增强算法

针对视听语音增强模型复杂度高且性能不佳的问题,提出基于卷积和门控注意的两阶段视听语音增强算法. 采用分块混合的门控注意单元(GAU)为特征融合网络主干,使用简化的单头注意力机制及应用块内二次-块间线性注意力,降低复杂度并捕获音视频序列的全局依赖关系. 在GAU中融入卷积模块,利用逐深度卷积和点卷积对音视频块内-块间局部特征进行提取,捕获音视频序列的局部依赖关系. 卷积与注意力相结合,可以显著提升模型在处理音视频序列时的性能. 为了利用2种模态包含的语音信息,采用两阶段算法,第1阶段音频作为主导模态,视频作为条件模态. 第2阶段视频作为主导模态,第1阶段提取的音频作为条件模态. 实验结果表明,提出的模型较现有模型在PESQ和SNR指标上均有显著提高,有效降低了复杂度.


关键词: 语音增强,  卷积,  注意力,  门控注意单元(GAU),  多模态 
Fig.1 Framework diagram of two-stage audio-visual speech enhancement algorithm based on convolution and gated attention
Fig.2 Structure diagram of convolutional module
Fig.3 Structure diagram of two-stage CNN-GAU module
参数数值
Conv1D卷积核大小L16
块长度P80
双阶段CNN-GAU模块个数R8
${{\boldsymbol{X}}'_{\text{U}}}$${{\boldsymbol{X}}'_{\text{I}}}$${{\boldsymbol{V}}_{\text{U}}}$特征维度M512
${{\boldsymbol{X}}'_{\text{Z}}}$${{\boldsymbol{V}}_{\text{Z}}}$特征维度D128
Tab.1 Experimental parameter setting for two-stage audio-visual speech enhancement algorithm based on convolution and gated attention
环境配置参数环境配置参数
CPUAMD Ryzen 7 3800X显存10 GB
主频3.89 GHzIDE环境Pycharm
内存48 GB编译语言Python 3.8
GPUNVIDIA GeForce RTX 3080
Tab.2 Experimental environment for two-stage audio-visual speech enhancement algorithm based on convolution and gated attention
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
Noisy1.541.671.862.152.47?10?50510
AV-ConvTasnet2.132.172.322.362.403.984.264.444.424.31
MuSE2.192.252.372.422.464.575.766.356.416.29
AV-Sepformer2.532.592.752.822.886.476.586.836.896.78
本文模型2.772.793.013.163.2512.2913.0914.1114.5414.55
Tab.3 Comparison of speech enhancement effect of various model under five signal-to-noise ratios of wind and rain noise
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
Noisy1.762.092.282.372.55?10?50510
AV-ConvTasnet2.272.332.422.472.523.973.964.204.044.06
MuSE2.322.422.542.622.675.926.136.476.246.17
AV-Sepformer2.692.742.842.902.936.756.586.896.776.81
本文模型2.963.103.283.393.4713.9314.0714.8014.5814.68
Tab.4 Comparison of speech enhancement effect of various model under five signal-to-noise ratios of ambulance noise
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
Noisy1.701.922.252.422.64?10?50510
AV-ConvTasnet2.352.322.552.602.654.394.504.504.294.32
MuSE2.352.552.652.692.736.276.806.916.476.38
AV-Sepformer2.672.812.862.922.936.776.796.826.626.76
本文模型2.883.083.253.353.4512.8913.5714.3314.1214.39
Tab.5 Comparison of speech enhancement effect of various model under five signal-to-noise ratios of alarm noise
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
实验12.532.592.752.822.886.476.586.836.896.78
实验22.562.632.782.932.9810.5211.7312.3313.3313.80
实验32.702.752.973.133.2211.8412.9514.0314.6414.68
实验42.772.793.013.163.2512.2913.0914.1114.5414.55
Tab.6 Influence of different modules on enhancing speech PESQ and SNR under wind and rain noise
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
实验12.692.742.842.902.936.756.586.896.776.81
实验22.882.983.183.273.3913.7713.6814.7714.6614.81
实验32.943.063.263.383.4714.0614.2314.9714.7614.89
实验42.963.103.283.393.4713.9314.0714.8014.5814.68
Tab.7 Influence of different modules on enhancing speech PESQ and SNR under ambulance noise
模型PESQSNR/dB
?10 dB?5 dB0 dB5 dB10 dB?10 dB?5 dB0 dB5 dB10 dB
实验12.672.812.862.922.936.776.796.826.626.76
实验22.812.993.153.243.2912.8813.3913.9913.9914.27
实验32.853.053.223.313.4212.8313.6714.4814.2714.54
实验42.883.083.253.353.4512.8913.5714.3314.1214.39
Tab.8 Influence of different modules on enhancing speech PESQ and SNR under alarm noise
模型参数量/106计算量/109
AV-ConvTasnet11.0322.08
MuSE15.0125.88
AV-Sepformer29.63141.83
实验25.2236.71
实验310.5268.27
实验413.1678.63
Tab.9 Comparison of complexity of various models
Fig.4 Comparison of enhanced speech spectrogram of different models on −5 dB wind and rain noise
[1]   张睿, 张鹏云, 孙超利 基于多域融合及神经架构搜索的语音增强方法[J]. 通信学报, 2024, 45 (2): 225- 239
ZHANG Rui, ZHANG Pengyun, SUN Chaoli Speech enhancement method based on multidomain fusion and neural architecture search[J]. Journal on Communications, 2024, 45 (2): 225- 239
doi: 10.11959/j.issn.1000-436x.2024018
[2]   纪鹏威, 全海燕 基于双生成器与频域判别器GAN语音增强算法[J]. 云南大学学报: 自然科学版, 2024, 46 (5): 871- 880
JI Pengwei, QUAN Haiyan GAN speech enhancement algorithm based on twin synthesizer and frequency domain discriminator[J]. Journal of Yunnan University: Natural Sciences Edition, 2024, 46 (5): 871- 880
[3]   AFOURAS T, CHUNG J S, ZISSERMAN A. The conversation: deep audio-visual speech enhancement [C]// Interspeech. Hyderabad: Curran Associates, 2018: 3244-3248.
[4]   AFOURAS T, CHUNG J S, ZISSERMAN A. My lips are concealed: audio-visual speech enhancement through obstructions [C]// Interspeech. Hyderabad: Curran Associates, 2019: 4295-4299.
[5]   MICHELSANTI D, TAN Z H, SIGURDSSON S, et al Deep-learning-based audio-visual speech enhancement in presence of Lombard effect[J]. Speech Communication, 2019, 115: 38- 50
doi: 10.1016/j.specom.2019.10.006
[6]   GOGATE M, DASHTIPOUR K, ADEEL A, et al CochleaNet: a robust language-independent audio-visual model for real-time speech enhancement[J]. Information Fusion, 2020, 63 (1): 273- 285
[7]   HOU J C, WANG S S, LAI Y H, et al Audio-visual speech enhancement using multimodal deep convolutional neural networks[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2018, 2 (2): 117- 128
doi: 10.1109/TETCI.2017.2784878
[8]   GABBAY A, SHAMIR A, PELEG S. Visual speech enhancement [C]// Interspeech. Hyderabad: Curran Associates, 2018: 1170-1174.
[9]   WU J, XU Y, ZHANG S X, et al. Time domain audio visual speech separation [C]//IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 667-673.
[10]   PAN Z, TAO R, XU C, et al. Muse: multi-modal target speaker extraction with visual cues [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6678-6682.
[11]   LIN J, CAI X, DINKEL H, et al. Av-Sepformer: cross-attention Sepformer for audio-visual target speaker extraction [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
[12]   HUA W, DAI Z, LIU H, et al. Transformer quality in linear time [C]// International Conference on Machine Learning. Baltimore: [s. n. ], 2022: 9099-9117.
[13]   LUO Y, CHEN Z, YOSHIOKA T. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation [C]// IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46-50.
[14]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770-778.
[15]   ZHAO S, MA B. Mossformer: pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions [C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Rhodes Island: IEEE, 2023: 1-5.
[16]   VASWANI A, SHAZEER N, PARMAR N, et al Attention is all you need[J]. Advances in Neural Information Processing System, 2017, 30 (1): 261- 272
[17]   SHAZEER N. Glu variants improve transformer [EB/OL]. (2020-02-12)[2024-07-15]. https://arxiv.org/pdf/2002.05202.
[18]   SU J, AHMED M, LU Y, et al Roformer: enhanced transformer with rotary position embedding[J]. Neurocomputing, 2024, 568: 127063
doi: 10.1016/j.neucom.2023.127063
[19]   PENG Y, DALMIA S, LANE I, et al. Branchformer: parallel mlp-attention architectures to capture local and global context for speech recognition and understanding [C]// International Conference on Machine Learning. Baltimore: [s. n. ], 2022: 17627-17643.
[20]   MU Z, YANG X. Separate in the speech chain: cross-modal conditional audio-visual target speech extraction [EB/OL]. (2024-05-05)[2024-07-15]. https://arxiv.org/pdf/2404.12725.
[1] Xuejun ZHANG,Shubin LIANG,Wanrong BAI,Fenghe ZHANG,Haiyan HUANG,Meifeng GUO,Zhuo CHEN. Source code vulnerability detection method based on heterogeneous graph representation[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1644-1652.
[2] Yishan LIN,Jing ZUO,Shuhua LU. Multimodal sentiment analysis based on multi-head self-attention mechanism and MLP-Interactor[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(8): 1653-1661.
[3] Rongtai YANG,Yubin SHAO,Qingzhi DU. Structure-aware model for few-shot knowledge completion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1394-1402.
[4] Xinjian XIANG,Tianshun YUAN,Yaqiang HE,Chengli WANG. Traffic flow prediction based on time series decomposition and soft thresholding temporal convolution[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1353-1361.
[5] Xinyu WEI,Lei RAO,Guangyu FAN,Niansheng CHEN,Songlin CHENG,Dingyu YANG. High-precision real-time semantic segmentation network for UAV remote sensing images[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1411-1420.
[6] Yuhao YANG,Yongcun GUO,Deyong LI,Shuang WANG. Segmentation and localization method of coal and gangue identification based on visual information[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1421-1433.
[7] Jie LIU,You WU,Jiahe TIAN,Ke HAN. Based on improved Transformer for super-resolution reconstruction of lung CT images[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1434-1442.
[8] Jingyao HE,Pengfei LI,Chengzhi WANG,Zhenming LV,Ping MU. Dynamic 3D reconstruction method using binocular vision and improved YOLOv8[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1443-1450.
[9] Shengju WANG,Zan ZHANG. Missing value imputation algorithm based on accelerated diffusion model[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1471-1480.
[10] Dongping ZHANG,Dawei WANG,Shuji HE,Siliang TANG,Zhiyong LIU,Zhongqiu LIU. Remaining useful life prediction of aircraft engines based on cross-dimensional feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(7): 1504-1513.
[11] Yongqing CAI,Cheng HAN,Wei QUAN,Wudi CHEN. Visual induced motion sickness estimation model based on attention mechanism[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1110-1118.
[12] Yan YANG,Lipeng CHAO. A two-branch feature joint dehazing network based on multidimensional collaborative attention[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1119-1129.
[13] Wenbo JU,Huajun DONG. Motherboard defect detection method based on context information fusion and dynamic sampling[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1159-1168.
[14] Gengliang LIANG,Shuguang HAN. Denim fabric defect detection algorithm based on improved RT-DETR[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1169-1178.
[15] Lihong WANG,Xinqian LIU,Jing LI,Zhiquan FENG. Network intrusion detection method based on federated learning and spatiotemporal feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(6): 1201-1210.