Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (4): 896-905    DOI: 10.3785/j.issn.1008-973X.2026.04.021
电子与信息工程     
基于多尺度注意力时序编码网络的语音诱发脑电解码
姚梓豪(),贾海蓉*(),李雅荣,陈桂军
太原理工大学 电子信息工程学院,山西 太原 030024
Speech-evoked EEG decoding based on Multi-scale Attention Temporal Encoding Network
Zihao YAO(),Hairong JIA*(),Yarong LI,Guijun CHEN
College of Electronic and Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
 全文: PDF(1844 KB)   HTML
摘要:

针对诱发隐性语音(无声/想象语音)的脑电信号特征复杂且数据获取困难的问题,提出多尺度注意力时序编码网络(MATE-Net),利用相对丰富的显性语音数据训练模型,应用于隐性语音解码任务. 模型通过Inception多感受野模块提取多尺度特征;引入双向GRU结构有效捕获前后文依赖关系,增强对时序动态的表征能力;为了解决深层网络训练问题,加入残差连接机制,确保梯度在反向传播过程中的稳定性;引入多头注意力机制以有效捕捉局部与全局时序依赖,增强关键特征的表达. 实验结果表明,本模型在显性语音解码任务中展现出良好的性能表现. 在五折交叉验证中,测试集的平均准确率达到 74.30%,且 Spearman 相关系数和 Pearson 相关系数分别为 0.884 与 0.942. MATE-Net的预训练模型能够成功应用于无声语音及想象语音任务,实现语音频谱的有效重构.

关键词: 脑机接口脑电图(EEG)显性语音无声语音想象语音    
Abstract:

A Multi-scale Attention Temporal Encoding Network (MATE-Net) was proposed to address the issues of complex EEG signal features and difficulty in acquiring elicited covert speech data (whispered and imagined). The relatively abundant overt speech data was leveraged to train the model, which was then applied to covert speech decoding tasks. An Inception-based multi-receptive field module was utilized to extract multi-scale features from the input signals, while a bidirectional GRU architecture was employed to capture contextual dependencies and improve the representation of temporal dynamics. To tackle the training issues of deep networks, a residual connection mechanism was added to ensure robust gradient flow during backpropagation. Moreover, a multi-head attention mechanism was introduced to effectively capture both local and global temporal dependencies, thereby strengthening the representation of salient features in the sequence. Experimental results showed that the model achieved excellent performance in overt speech decoding, with an average accuracy of 74.30% on the test set and Spearman and Pearson correlation coefficients of 0.884 and 0.942, respectively, in five-fold cross-validation. The pre-trained MATE-Net was successfully applied to whispered and imagined speech tasks, enabling effective reconstruction of speech spectrograms.

Key words: brain-computer interface    electroencephalography (EEG)    overt speech    whispered speech    imagined speech
收稿日期: 2025-04-21 出版日期: 2026-03-19
CLC:  TN 911.7  
基金资助: 国家自然科学基金资助项目(62201377);山西省基础研究计划资助项目(202403021211098);山西省研究生创新项目(RC2400005582).
通讯作者: 贾海蓉     E-mail: 1078349047@qq.com;helenjia722@163.com
作者简介: 姚梓豪(2000—),男,硕士生,从事语音信号研究. orcid.org/0009-0008-2621-0766. E-mail:1078349047@qq.com
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
姚梓豪
贾海蓉
李雅荣
陈桂军

引用本文:

姚梓豪,贾海蓉,李雅荣,陈桂军. 基于多尺度注意力时序编码网络的语音诱发脑电解码[J]. 浙江大学学报(工学版), 2026, 60(4): 896-905.

Zihao YAO,Hairong JIA,Yarong LI,Guijun CHEN. Speech-evoked EEG decoding based on Multi-scale Attention Temporal Encoding Network. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 896-905.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.04.021        https://www.zjujournals.com/eng/CN/Y2026/V60/I4/896

图 1  语音诱发脑电解码的整体流程
图 2  多尺度注意力时序编码网络(MATE-Net)的架构
环境名称配置参数
系统Windows11
CPUIntel Core i9-14900KF
主频3.20 GHz
内存128 GB
GPUNVIDIA RTX 4090D
显存24 GB
IDE环境Pycharm
编译语言Python 3.10
表 1  实验软硬件环境配置参数
折叠编号Acctrain/%Acctest/%MSEScosρr
Fold176.3874.330.3190.9750.8900.945
Fold276.3274.000.3440.9750.8800.937
Fold376.3174.140.3340.9750.8850.941
Fold476.3574.430.3140.9750.8840.943
Fold576.3174.600.3140.9750.8820.943
平均值76.3374.300.3250.9750.8840.942
表 2  五折交叉验证的实验结果
模型结构Acctrain/%ρr
模型1(保留Inception)53.460.6090.634
模型2(保留Inception+残差)63.900.7430.795
模型3(保留Inception+残差+GRU)68.590.8290.893
模型4(MATE-Net)74.300.8840.942
表 3  MATE-Net 关键模块对模型性能影响的消融实验结果
模型名称Acctrain/%ρr
LDA56.610.6450.661
Logistic Regression55.390.6400.658
Decision Tree61.250.7260.765
Shallowconv[27]55.150.4930.459
deepconv[27]57.490.6480.703
EEGNet[27]58.310.5410.570
EEGItnet[28]54.660.5320.542
EEGTrans[29]59.460.6830.713
EEGTcNet[30]61.800.7120.738
EEGformer[31]64.390.8130.872
平均值74.300.8840.942
表 4  MATE-Net 与现有主流解码方法的性能对比
图 3  原始频谱图和重构频谱图对比
图 4  单词级的原始梅尔频谱图和重构梅尔频谱图对比
图 5  原始语音波形和重构语音波形对比
模型名称Acctrain (P1/P2/P3)/%ρ(P1/P2/P3)r(P1/P2/P3)
LDA32.27 / 33.60 / 33.490.478 / 0.457 / 0.4870.472 / 0.439 / 0.448
Decision Tree45.55 / 46.92 / 45.010.581 / 0.593 / 0.5710.582 / 0.602 / 0.570
EEGformer54.34 / 53.50 / 54.790.781 / 0.783 / 0.7660.782 / 0.789 / 0.770
MATE-Net68.72 / 67.80 / 66.990.937 / 0.905 / 0.9220.938 / 0.907 / 0.924
表 5  跨数据集对比实验结果
模型名称Mean r(无声/想象)Median r(无声/想象)
EEGformer0.313 / 0.3660.352 / 0.391
MATE-Net0.376 / 0.4490.437 / 0.467
表 6  显性语音与隐性语音的相关性
图 6  显性语音与隐性语音的频谱和波形对比
1 HILARI K, NEEDLE J J, HARRISON K L What are the important factors in health-related quality of life for people with aphasia? a systematic review[J]. Archives of Physical Medicine and Rehabilitation, 2012, 93 (1): S86- S95
doi: 10.1016/j.apmr.2011.05.028
2 JELLINGER K A The spectrum of cognitive dysfunction in amyotrophic lateral sclerosis: an update[J]. International Journal of Molecular Sciences, 2023, 24 (19): 14647
doi: 10.3390/ijms241914647
3 刘近贞, 叶方方, 熊慧 基于卷积神经网络的多类运动想象脑电信号识别[J]. 浙江大学学报: 工学版, 2021, 55 (11): 2054- 2066
LIU Jinzhen, YE Fangfang, XIONG Hui Recognition of multi class motor imagery EEG signals based on convolutional neural network[J]. Journal of Zhejiang University: Engineering Science, 2021, 55 (11): 2054- 2066
4 TANG J Effect of brain-computer interface training on functional recovery after stroke[J]. Theoretical and Natural Science, 2023, 21 (1): 75- 79
doi: 10.54254/2753-8818/21/20230821
5 LUO S, ANGRICK M, COOGAN C, et al Stable decoding from a speech BCI enables control for an individual with ALS without recalibration for 3 months[J]. Advanced Science, 2023, 10 (35): e2304853
doi: 10.1002/advs.202304853
6 ANUMANCHIPALLI G K, CHARTIER J, CHANG E F Speech synthesis from neural decoding of spoken sentences[J]. Nature, 2019, 568 (7753): 493- 498
doi: 10.1038/s41586-019-1119-1
7 LI M, LIAO S, PUN S H, et al. Effects of EEG analysis window location on classifying spoken mandarin monosyllables [C]// 11th International IEEE/EMBS Conference on Neural Engineering. Baltimore: IEEE, 2023: 1–4.
8 MELINDA M, JUWONO F H, ENRIKO I K A, et al Application of continuous wavelet transform and support vector machine for autism spectrum disorder electroencephalography signal classification[J]. Radioelectronic and Computer Systems, 2023, (3): 73- 90
doi: 10.32620/reks.2023.3.07
9 LIU H, ONG Y S, YU Z, et al Scalable Gaussian process classification with additive noise for non-Gaussian likelihoods[J]. IEEE Transactions on Cybernetics, 2022, 52 (7): 5842- 5854
doi: 10.1109/TCYB.2020.3043355
10 ALENAZI F S, EL HINDI K, ASSADHAN B Complement-class harmonized Naïve Bayes classifier[J]. Applied Sciences, 2023, 13 (8): 4852
doi: 10.3390/app13084852
11 ABDULGHANI M M, WALTERS W L, ABED K H Imagined speech classification using EEG and deep learning[J]. Bioengineering, 2023, 10 (6): 649
doi: 10.3390/bioengineering10060649
12 QI H, GAO N. Research on the classification algorithm of imaginary speech EEG signals based on twin neural network [C]// 7th International Conference on Signal and Image Processing. Suzhou: IEEE, 2022: 211–216.
13 GASPARINI F, CAZZANIGA E, SAIBENE A. Inner speech recognition through electroencephalographic signals [EB/OL]. (2022–10–11) [2025–04–21]. https://arxiv.org/abs/2210.06472.
14 PARK H J, LEE B Multiclass classification of imagined speech EEG using noise-assisted multivariate empirical mode decomposition and multireceptive field convolutional neural network[J]. Frontiers in Human Neuroscience, 2023, 17: 1186594
doi: 10.3389/fnhum.2023.1186594
15 VORONTSOVA D, MENSHIKOV I, ZUBOV A, et al Silent EEG-speech recognition using convolutional and recurrent neural network with 85% accuracy of 9 words classification[J]. Sensors, 2021, 21 (20): 6744
doi: 10.3390/s21206744
16 CHEN X, WANG R, KHALILIAN-GOURTANI A, et al A neural speech decoding framework leveraging deep learning and speech synthesis[J]. Nature Machine Intelligence, 2024, 6 (4): 467- 480
doi: 10.1038/s42256-024-00824-8
17 MARTIN S, BRUNNER P, ITURRATE I, et al Word pair classification during imagined speech using direct brain recordings[J]. Scientific Reports, 2016, 6: 25803
doi: 10.1038/srep25803
18 KOMEIJI S, MITSUHASHI T, IIMURA Y, et al Feasibility of decoding covert speech in ECoG with a Transformer trained on overt speech[J]. Scientific Reports, 2024, 14: 11491
doi: 10.1038/s41598-024-62230-9
19 CANOLTY R T, EDWARDS E, DALAL S S, et al High gamma power is phase-locked to theta oscillations in human neocortex[J]. Science, 2006, 313 (5793): 1626- 1628
doi: 10.1126/science.1128115
20 ABDI H, WILLIAMS L J Principal component analysis[J]. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2 (4): 433- 459
doi: 10.1002/wics.101
21 SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions [C]// IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1–9.
22 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
23 DEY R, SALEM F M. Gate-variants of gated recurrent unit (GRU) neural networks [C]// IEEE 60th International Midwest Symposium on Circuits and Systems. Boston: IEEE, 2017: 1597–1600.
24 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30.
25 SRIVASTAVA N, HINTON G E, KRIZHEVSKY A, et al Dropout: a simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2020, 15: 1929- 1958
26 ANGRICK M, OTTENHOFF M C, DIENER L, et al Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity[J]. Communications Biology, 2021, 4: 1055
doi: 10.1038/s42003-021-02578-0
27 LAWHERN V J, SOLON A J, WAYTOWICH N R, et al EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces[J]. Journal of Neural Engineering, 2018, 15 (5): 056013
doi: 10.1088/1741-2552/aace8c
28 SALAMI A, ANDREU-PEREZ J, GILLMEISTER H EEG-ITNet: an explainable inception temporal convolutional network for motor imagery classification[J]. IEEE Access, 2022, 10: 36672- 36685
doi: 10.1109/ACCESS.2022.3161489
29 LEE Y E, LEE S H. EEG-transformer: self-attention from transformer architecture for decoding EEG of imagined speech [C]// 10th International Winter Conference on Brain-Computer Interface. Gangwon-do: IEEE, 2022: 1–4.
30 INGOLFSSON T M, HERSCHE M, WANG X, et al. EEG-TCNet: an accurate temporal convolutional network for embedded motor-imagery brain-machine interfaces [C]// IEEE International Conference on Systems, Man, and Cybernetics. Toronto: IEEE, 2020: 2958–2965.
31 WAN Z, LI M, LIU S, et al EEGformer: a transformer-based brain activity classification method using EEG signal[J]. Frontiers in Neuroscience, 2023, 17: 1148855
doi: 10.3389/fnins.2023.1148855
[1] 董镇滔,徐暟敏,万清颖,刘晓菲,申昊,李书涵,奇格奇. 基于交通事件短视频资源的多模态情绪特征分析[J]. 浙江大学学报(工学版), 2025, 59(4): 661-668.
[2] 李宜轩,李颖,肖倩,王灵月,尹宁,杨硕. 不同情绪错误记忆的脑电微状态功能网络分析[J]. 浙江大学学报(工学版), 2025, 59(1): 49-61.
[3] 吴书晗,王丹,陈远方,贾子钰,张越棋,许萌. 融合注意力的滤波器组双视图图卷积运动想象脑电分类[J]. 浙江大学学报(工学版), 2024, 58(7): 1326-1335.
[4] 刘近贞,叶方方,熊慧. 基于卷积神经网络的多类运动想象脑电信号识别[J]. 浙江大学学报(工学版), 2021, 55(11): 2054-2066.
[5] 童基均, 李琳, 林勤光, 朱丹华. 采用平滑伪Wigner-Ville分布的SSVEP脑机接口系统[J]. 浙江大学学报(工学版), 2017, 51(3): 598-604.
[6] 杨帮华,韩志军,王倩,何亮飞. 分形维数结合RLS-ICA的脑电信号消噪[J]. 浙江大学学报(工学版), 2014, 48(7): 1234-1240.
[7] 杨帮华, 何美燕, 刘丽, 陆文宇. 脑机接口中基于BISVM的EEG分类[J]. J4, 2013, 47(8): 1431-1436.
[8] 施锦河, 沈继忠, 王攀. 四类运动想象脑电信号特征提取与分类算法[J]. J4, 2012, 46(2): 338-344.
[9] 张韶岷, 陈卫东, 孙超, 等. 动物脑电-行为同步记录及分析系统[J]. J4, 2009, 43(11): 2028-2033.