Please wait a minute...
Journal of ZheJiang University (Engineering Science)  2026, Vol. 60 Issue (4): 896-905    DOI: 10.3785/j.issn.1008-973X.2026.04.021
    
Speech-evoked EEG decoding based on Multi-scale Attention Temporal Encoding Network
Zihao YAO(),Hairong JIA*(),Yarong LI,Guijun CHEN
College of Electronic and Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
Download: HTML     PDF(1844KB) HTML
Export: BibTeX | EndNote (RIS)      

Abstract  

A Multi-scale Attention Temporal Encoding Network (MATE-Net) was proposed to address the issues of complex EEG signal features and difficulty in acquiring elicited covert speech data (whispered and imagined). The relatively abundant overt speech data was leveraged to train the model, which was then applied to covert speech decoding tasks. An Inception-based multi-receptive field module was utilized to extract multi-scale features from the input signals, while a bidirectional GRU architecture was employed to capture contextual dependencies and improve the representation of temporal dynamics. To tackle the training issues of deep networks, a residual connection mechanism was added to ensure robust gradient flow during backpropagation. Moreover, a multi-head attention mechanism was introduced to effectively capture both local and global temporal dependencies, thereby strengthening the representation of salient features in the sequence. Experimental results showed that the model achieved excellent performance in overt speech decoding, with an average accuracy of 74.30% on the test set and Spearman and Pearson correlation coefficients of 0.884 and 0.942, respectively, in five-fold cross-validation. The pre-trained MATE-Net was successfully applied to whispered and imagined speech tasks, enabling effective reconstruction of speech spectrograms.



Key wordsbrain-computer interface      electroencephalography (EEG)      overt speech      whispered speech      imagined speech     
Received: 21 April 2025      Published: 19 March 2026
CLC:  TN 911.7  
Fund:  国家自然科学基金资助项目(62201377);山西省基础研究计划资助项目(202403021211098);山西省研究生创新项目(RC2400005582).
Corresponding Authors: Hairong JIA     E-mail: 1078349047@qq.com;helenjia722@163.com
Cite this article:

Zihao YAO,Hairong JIA,Yarong LI,Guijun CHEN. Speech-evoked EEG decoding based on Multi-scale Attention Temporal Encoding Network. Journal of ZheJiang University (Engineering Science), 2026, 60(4): 896-905.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2026.04.021     OR     https://www.zjujournals.com/eng/Y2026/V60/I4/896


基于多尺度注意力时序编码网络的语音诱发脑电解码

针对诱发隐性语音(无声/想象语音)的脑电信号特征复杂且数据获取困难的问题,提出多尺度注意力时序编码网络(MATE-Net),利用相对丰富的显性语音数据训练模型,应用于隐性语音解码任务. 模型通过Inception多感受野模块提取多尺度特征;引入双向GRU结构有效捕获前后文依赖关系,增强对时序动态的表征能力;为了解决深层网络训练问题,加入残差连接机制,确保梯度在反向传播过程中的稳定性;引入多头注意力机制以有效捕捉局部与全局时序依赖,增强关键特征的表达. 实验结果表明,本模型在显性语音解码任务中展现出良好的性能表现. 在五折交叉验证中,测试集的平均准确率达到 74.30%,且 Spearman 相关系数和 Pearson 相关系数分别为 0.884 与 0.942. MATE-Net的预训练模型能够成功应用于无声语音及想象语音任务,实现语音频谱的有效重构.


关键词: 脑机接口,  脑电图(EEG),  显性语音,  无声语音,  想象语音 
Fig.1 Overall process of speech-evoked EEG decoding
Fig.2 Multi-scale Attention Temporal Encoding Network (MATE-Net) architecture
环境名称配置参数
系统Windows11
CPUIntel Core i9-14900KF
主频3.20 GHz
内存128 GB
GPUNVIDIA RTX 4090D
显存24 GB
IDE环境Pycharm
编译语言Python 3.10
Tab.1 Hardware and software configuration of experiment
折叠编号Acctrain/%Acctest/%MSEScosρr
Fold176.3874.330.3190.9750.8900.945
Fold276.3274.000.3440.9750.8800.937
Fold376.3174.140.3340.9750.8850.941
Fold476.3574.430.3140.9750.8840.943
Fold576.3174.600.3140.9750.8820.943
平均值76.3374.300.3250.9750.8840.942
Tab.2 Experimental results of five-fold cross-validation
模型结构Acctrain/%ρr
模型1(保留Inception)53.460.6090.634
模型2(保留Inception+残差)63.900.7430.795
模型3(保留Inception+残差+GRU)68.590.8290.893
模型4(MATE-Net)74.300.8840.942
Tab.3 Ablation study results on impact of key modules in MATE-Net on model performance
模型名称Acctrain/%ρr
LDA56.610.6450.661
Logistic Regression55.390.6400.658
Decision Tree61.250.7260.765
Shallowconv[27]55.150.4930.459
deepconv[27]57.490.6480.703
EEGNet[27]58.310.5410.570
EEGItnet[28]54.660.5320.542
EEGTrans[29]59.460.6830.713
EEGTcNet[30]61.800.7120.738
EEGformer[31]64.390.8130.872
平均值74.300.8840.942
Tab.4 Performance comparison between MATE-Net and existing decoding methods
Fig.3 Comparison of original and reconstructed spectrograms
Fig.4 Comparison of original Mel spectrogram and reconstructed Mel spectrogram at word level
Fig.5 Comparison of original and reconstructed speech waveforms
模型名称Acctrain (P1/P2/P3)/%ρ(P1/P2/P3)r(P1/P2/P3)
LDA32.27 / 33.60 / 33.490.478 / 0.457 / 0.4870.472 / 0.439 / 0.448
Decision Tree45.55 / 46.92 / 45.010.581 / 0.593 / 0.5710.582 / 0.602 / 0.570
EEGformer54.34 / 53.50 / 54.790.781 / 0.783 / 0.7660.782 / 0.789 / 0.770
MATE-Net68.72 / 67.80 / 66.990.937 / 0.905 / 0.9220.938 / 0.907 / 0.924
Tab.5 Cross-dataset comparative experimental results
模型名称Mean r(无声/想象)Median r(无声/想象)
EEGformer0.313 / 0.3660.352 / 0.391
MATE-Net0.376 / 0.4490.437 / 0.467
Tab.6 Correlation between overt speech and covert speech
Fig.6 Spectral and waveform comparison between overt and covert speech
[1]   HILARI K, NEEDLE J J, HARRISON K L What are the important factors in health-related quality of life for people with aphasia? a systematic review[J]. Archives of Physical Medicine and Rehabilitation, 2012, 93 (1): S86- S95
doi: 10.1016/j.apmr.2011.05.028
[2]   JELLINGER K A The spectrum of cognitive dysfunction in amyotrophic lateral sclerosis: an update[J]. International Journal of Molecular Sciences, 2023, 24 (19): 14647
doi: 10.3390/ijms241914647
[3]   刘近贞, 叶方方, 熊慧 基于卷积神经网络的多类运动想象脑电信号识别[J]. 浙江大学学报: 工学版, 2021, 55 (11): 2054- 2066
LIU Jinzhen, YE Fangfang, XIONG Hui Recognition of multi class motor imagery EEG signals based on convolutional neural network[J]. Journal of Zhejiang University: Engineering Science, 2021, 55 (11): 2054- 2066
[4]   TANG J Effect of brain-computer interface training on functional recovery after stroke[J]. Theoretical and Natural Science, 2023, 21 (1): 75- 79
doi: 10.54254/2753-8818/21/20230821
[5]   LUO S, ANGRICK M, COOGAN C, et al Stable decoding from a speech BCI enables control for an individual with ALS without recalibration for 3 months[J]. Advanced Science, 2023, 10 (35): e2304853
doi: 10.1002/advs.202304853
[6]   ANUMANCHIPALLI G K, CHARTIER J, CHANG E F Speech synthesis from neural decoding of spoken sentences[J]. Nature, 2019, 568 (7753): 493- 498
doi: 10.1038/s41586-019-1119-1
[7]   LI M, LIAO S, PUN S H, et al. Effects of EEG analysis window location on classifying spoken mandarin monosyllables [C]// 11th International IEEE/EMBS Conference on Neural Engineering. Baltimore: IEEE, 2023: 1–4.
[8]   MELINDA M, JUWONO F H, ENRIKO I K A, et al Application of continuous wavelet transform and support vector machine for autism spectrum disorder electroencephalography signal classification[J]. Radioelectronic and Computer Systems, 2023, (3): 73- 90
doi: 10.32620/reks.2023.3.07
[9]   LIU H, ONG Y S, YU Z, et al Scalable Gaussian process classification with additive noise for non-Gaussian likelihoods[J]. IEEE Transactions on Cybernetics, 2022, 52 (7): 5842- 5854
doi: 10.1109/TCYB.2020.3043355
[10]   ALENAZI F S, EL HINDI K, ASSADHAN B Complement-class harmonized Naïve Bayes classifier[J]. Applied Sciences, 2023, 13 (8): 4852
doi: 10.3390/app13084852
[11]   ABDULGHANI M M, WALTERS W L, ABED K H Imagined speech classification using EEG and deep learning[J]. Bioengineering, 2023, 10 (6): 649
doi: 10.3390/bioengineering10060649
[12]   QI H, GAO N. Research on the classification algorithm of imaginary speech EEG signals based on twin neural network [C]// 7th International Conference on Signal and Image Processing. Suzhou: IEEE, 2022: 211–216.
[13]   GASPARINI F, CAZZANIGA E, SAIBENE A. Inner speech recognition through electroencephalographic signals [EB/OL]. (2022–10–11) [2025–04–21]. https://arxiv.org/abs/2210.06472.
[14]   PARK H J, LEE B Multiclass classification of imagined speech EEG using noise-assisted multivariate empirical mode decomposition and multireceptive field convolutional neural network[J]. Frontiers in Human Neuroscience, 2023, 17: 1186594
doi: 10.3389/fnhum.2023.1186594
[15]   VORONTSOVA D, MENSHIKOV I, ZUBOV A, et al Silent EEG-speech recognition using convolutional and recurrent neural network with 85% accuracy of 9 words classification[J]. Sensors, 2021, 21 (20): 6744
doi: 10.3390/s21206744
[16]   CHEN X, WANG R, KHALILIAN-GOURTANI A, et al A neural speech decoding framework leveraging deep learning and speech synthesis[J]. Nature Machine Intelligence, 2024, 6 (4): 467- 480
doi: 10.1038/s42256-024-00824-8
[17]   MARTIN S, BRUNNER P, ITURRATE I, et al Word pair classification during imagined speech using direct brain recordings[J]. Scientific Reports, 2016, 6: 25803
doi: 10.1038/srep25803
[18]   KOMEIJI S, MITSUHASHI T, IIMURA Y, et al Feasibility of decoding covert speech in ECoG with a Transformer trained on overt speech[J]. Scientific Reports, 2024, 14: 11491
doi: 10.1038/s41598-024-62230-9
[19]   CANOLTY R T, EDWARDS E, DALAL S S, et al High gamma power is phase-locked to theta oscillations in human neocortex[J]. Science, 2006, 313 (5793): 1626- 1628
doi: 10.1126/science.1128115
[20]   ABDI H, WILLIAMS L J Principal component analysis[J]. Wiley Interdisciplinary Reviews: Computational Statistics, 2010, 2 (4): 433- 459
doi: 10.1002/wics.101
[21]   SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions [C]// IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1–9.
[22]   HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
[23]   DEY R, SALEM F M. Gate-variants of gated recurrent unit (GRU) neural networks [C]// IEEE 60th International Midwest Symposium on Circuits and Systems. Boston: IEEE, 2017: 1597–1600.
[24]   VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. Advances in Neural Information Processing Systems, 2017, 30.
[25]   SRIVASTAVA N, HINTON G E, KRIZHEVSKY A, et al Dropout: a simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2020, 15: 1929- 1958
[26]   ANGRICK M, OTTENHOFF M C, DIENER L, et al Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity[J]. Communications Biology, 2021, 4: 1055
doi: 10.1038/s42003-021-02578-0
[27]   LAWHERN V J, SOLON A J, WAYTOWICH N R, et al EEGNet: a compact convolutional neural network for EEG-based brain–computer interfaces[J]. Journal of Neural Engineering, 2018, 15 (5): 056013
doi: 10.1088/1741-2552/aace8c
[28]   SALAMI A, ANDREU-PEREZ J, GILLMEISTER H EEG-ITNet: an explainable inception temporal convolutional network for motor imagery classification[J]. IEEE Access, 2022, 10: 36672- 36685
doi: 10.1109/ACCESS.2022.3161489
[29]   LEE Y E, LEE S H. EEG-transformer: self-attention from transformer architecture for decoding EEG of imagined speech [C]// 10th International Winter Conference on Brain-Computer Interface. Gangwon-do: IEEE, 2022: 1–4.
[30]   INGOLFSSON T M, HERSCHE M, WANG X, et al. EEG-TCNet: an accurate temporal convolutional network for embedded motor-imagery brain-machine interfaces [C]// IEEE International Conference on Systems, Man, and Cybernetics. Toronto: IEEE, 2020: 2958–2965.
[31]   WAN Z, LI M, LIU S, et al EEGformer: a transformer-based brain activity classification method using EEG signal[J]. Frontiers in Neuroscience, 2023, 17: 1148855
doi: 10.3389/fnins.2023.1148855
[1] Guanghui YAN,Xiao HUANG,Wenwen CHANG. Emergency braking behavior recognition based on EEG multi-scale features and graph neural networks[J]. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 404-414.
[2] Fan DU,Yong WANG,Jun YAN,Hongxiang GUO. Steady-state visual evoked potential signal recognition based on cross-subject neighboring stimulus learning[J]. Journal of ZheJiang University (Engineering Science), 2025, 59(12): 2472-2482.
[3] Shuhan WU,Dan WANG,Yuanfang CHEN,Ziyu JIA,Yueqi ZHANG,Meng XU. Attention-fused filter bank dual-view graph convolution motor imagery EEG classification[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(7): 1326-1335.
[4] Bo ZHONG,Pengfei WANG,Yiqiao WANG,Xiaoling WANG. Survey of deep learning based EEG data analysis technology[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(5): 879-890.
[5] Yifan ZHOU,Lingwei ZHANG,Zhengdong ZHOU,Zhi CAI,Mengyao YUAN,Xiaoxi YUAN,Zeyi YANG. Classification of group speech imagined EEG signals based on attention mechanism and deep learning[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(12): 2540-2546.
[6] Ling-wei ZHANG,Zheng-dong ZHOU,Yun-fei XU,Jia-wen WANG,Wen-tao JI,Ze-feng SONG. Classification of imagined speech EEG signals based on feature fusion[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(4): 726-734.
[7] Jin-zhen LIU,Fang-fang YE,Hui XIONG. Recognition of multi-class motor imagery EEG signals based on convolutional neural network[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(11): 2054-2066.
[8] TONG Ji-jun, LI Lin, LIN Qin-guang, ZHU Dan-hua. SSVEP brain-computer interface (BCI) system using smoothed pseudo Wigner-Ville distribution[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(3): 598-604.
[9] ZHU Fan, JIANG Kai, LV Rong-kun, ZHANG Shao-min, ZHENG Xiao-xiang. Small range big-trip pressure lever detection system and
its application in brain-computer interface
[J]. Journal of ZheJiang University (Engineering Science), 2011, 45(9): 1693-1696.