Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (7): 1381-1391    DOI: 10.3785/j.issn.1008-973X.2026.07.002
计算机与控制工程     
基于光流和卷积视觉Transformer的轻量级微表情识别
徐恺蔚1,2(),KHIZER BIN TALIBHafiz1,2,曹衍龙1,2,*(),许源平3,许志杰4,宋景春5
1. 浙江大学 机械工程学院,浙江 杭州 310058
2. 浙江大学 流体动力基础件与机电系统全国重点实验室,浙江 杭州 310058
3. 成都信息工程大学 软件工程学院,四川 成都 610225
4. 西交利物浦大学 智能工程学院,江苏 苏州 215123
5. 中国人民解放军联勤保障部队第九〇八医院 重症医学科,江西 南昌 330002
Lightweight micro-expression recognition based on optical flow and convolutional vision Transformer
Kaiwei XU1,2(),Hafiz KHIZER BIN TALIB1,2,Yanlong CAO1,2,*(),Yuanping XU3,Zhijie XU4,Jingchun SONG5
1. School of Mechanical Engineering, Zhejiang University, Hangzhou 310058, China
2. State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University, Hangzhou 310058, China
3. School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
4. School of Advanced Technology, Xi’an Jiaotong-Liverpool University, Suzhou 215123, China
5. Department of Critical Care Medicine, 908th Hospital of Joint Logistic Support Force of Chinese PLA, Nanchang 330002, China
 全文: PDF(1800 KB)   HTML
摘要:

针对微表情持续时间短、动作强度低和样本量不足的问题,提出基于光流和卷积视觉Transformer的轻量级微表情识别方法. 通过提取起始帧与峰值帧之间人脸的光流和光应变,突出面部肌肉的运动,有效减少纹理干扰并降低特征维度;引入基于身份域的对抗域适应方法,充分利用受试者标签,去除微表情特征中的无关成分;构建轻型的多阶段CNN-Transformer混合模型MiER-CvT,包括卷积嵌入层、卷积Transformer模块和SeqSoftmax层,以增强模型对微表情的局部表征能力和信息整合能力. 实验结果表明,所提方法在MEGC 2019数据集上取得了0.9171的UF1值和0.9192的UAR值,且MiER-CvT的参数量和计算量分别为7.5 M和0.1 G. 相比于MiMaNet等方法,所提方法兼具高精度和轻量化的优势.

关键词: 微表情识别光流估计卷积视觉Transformer注意力机制域适应    
Abstract:

A lightweight micro-expression recognition method based on optical flow and convolutional vision Transformer was proposed to solve the problems of short duration, low motion intensity and insufficient sample size of micro-expressions. The optical flow and optical strain of human faces between the onset frame and the apex frame were extracted to highlight the movement of facial muscles, thereby effectively reducing the texture interference and lowering the feature dimension. The adversarial domain adaptation method based on identity domain was adopted to further remove the irrelevant components in the micro-expression features by making full use of the subjects’ labels. A lightweight multi-stage CNN-Transformer hybrid model named MiER-CvT, including the convolutional embedding layer, the convolutional Transformer block and the SeqSoftmax layer, was constructed to enhance the model’s capabilities of local representation and information integration for micro-expressions. The experimental results showed that the proposed method achieved a UF1 score of 0.9171 and a UAR score of 0.9192 on the MEGC 2019 dataset, and the parameter number and computational complexity of MiER-CvT were 7.5 M and 0.1 G, respectively. Compared with the existing methods, such as MiMaNet, the proposed method has the advantages of high precision and light weight.

Key words: micro-expression recognition    optical flow estimation    convolutional vision Transformer    attention mechanism    domain adaptation
收稿日期: 2025-03-30 出版日期: 2026-05-23
CLC:  TP 391.4  
基金资助: 青岛市关键技术攻关及产业化示范类资助项目(23-7-2-qljh-2-gx);苏州市工业园区科教领军人才资助项目(KJL2024104).
通讯作者: 曹衍龙     E-mail: kaiweixu@zju.edu.cn;sdcaoyl@zju.edu.cn
作者简介: 徐恺蔚(2000—),男,硕士生,从事深度学习和计算机视觉研究. orcid.org/0009-0002-1857-5710. E-mail:kaiweixu@zju.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
徐恺蔚
KHIZER BIN TALIBHafiz
曹衍龙
许源平
许志杰
宋景春

引用本文:

徐恺蔚,KHIZER BIN TALIBHafiz,曹衍龙,许源平,许志杰,宋景春. 基于光流和卷积视觉Transformer的轻量级微表情识别[J]. 浙江大学学报(工学版), 2026, 60(7): 1381-1391.

Kaiwei XU,Hafiz KHIZER BIN TALIB,Yanlong CAO,Yuanping XU,Zhijie XU,Jingchun SONG. Lightweight micro-expression recognition based on optical flow and convolutional vision Transformer. Journal of ZheJiang University (Engineering Science), 2026, 60(7): 1381-1391.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.07.002        https://www.zjujournals.com/eng/CN/Y2026/V60/I7/1381

图 1  所提方法的整体框架
图 2  不同受试者的微表情光流特征
图 3  基于身份域的对抗域适应方法
图 4  卷积Transformer模块与ViT Transformer结构对比
表情类别FullSMICCASME ⅡSAMM
消极250708892
积极109513226
惊讶83432515
总计442164145133
表 1  MEGC 2019数据集的样本分布
方法FullSMICCASME ⅡSAMM
UF1UARUF1UARUF1UARUF1UAR
LBP-TOP[3]0.58820.57850.20000.52800.70260.74290.39540.4102
Bi-WOOF[7]0.62960.62270.57270.58290.78050.80260.52110.5139
STSTNet[1]0.73530.76050.68010.70130.83820.86860.65880.6810
IncepTR[23]0.75300.74600.65500.65000.91100.89600.69100.6940
EDSMISEViTNet[27]0.75870.77360.73720.71390.85210.84610.72160.6781
SLSTT-LSTM[18]0.81600.79000.74000.72000.90100.88500.71500.6430
BDCNN[15]0.85090.85000.78590.78690.95010.95160.81860.7994
ViT-16/B[21]0.85120.83970.80440.79940.92200.91370.81420.7847
HTNet[22]0.86030.84750.80490.79050.95320.95160.81310.8124
MTMNet[10]0.86400.85700.86400.86100.87000.87200.82500.8190
ResNet-18[12]0.86960.87240.79180.79820.95940.96120.88200.8549
MiMaNet[11]0.88300.87600.87300.86700.88100.88100.89600.8840
Micron-BERT[29]0.89030.8842
MiER-CvT0.91020.91020.85120.85470.98580.98580.90800.8936
MiER-CvT+身份域0.91710.91920.85460.86010.99280.98960.91770.9064
表 2  所提方法与现有方法的性能对比
图 5  所提方法与现有方法在MEGC 2019数据集上的性能对比
模型Np/MFLOPs/GFull UF1
ViT-16/B[21]86.617.60.8512
ResNet-18[12]11.71.80.8696
EDSMISEViTNet[27]3.90.7587
MiER-CvT7.50.10.9102
表 3  模型复杂度及性能对比
模型序号模型类型卷积嵌入层注意力头数Transformer层数Full
$K_i^{{\mathrm{e}}}$$S_i^{{\mathrm{e}}}$$P_i^{{\mathrm{e}}}$${C_i}$UF1UAR
1基准三阶段模型MiER-CvT32, 5, 316, 3, 18, 0, 064, 192, 3841, 3, 61, 2, 30.907 10.906 2
2不同层数的三阶段模型32, 5, 316, 3, 18, 0, 064, 192, 3841, 3, 61, 2, 40.905 10.906 6
332, 5, 316, 3, 18, 0, 064, 192, 3841, 3, 61, 2, 50.906 90.905 8
432, 5, 316, 3, 18, 0, 064, 192, 3841, 3, 61, 3, 30.902 60.907 0
532, 5, 316, 3, 18, 0, 064, 192, 3841, 3, 61, 3, 40.906 00.901 8
632, 5, 316, 3, 18, 0, 064, 192, 3841, 3, 62, 2, 30.899 00.900 5
7不同嵌入维度的三阶段模型32, 5, 316, 3, 18, 0, 064, 192, 7681, 3, 121, 2, 30.903 30.904 9
832, 5, 316, 3, 18, 0, 064, 384, 7681, 6, 121, 2, 30.907 40.909 3
9常规卷积三阶段模型7, 3, 34, 2, 22, 1, 164, 192, 3841, 3, 61, 2, 30.844 30.847 0
10双阶段模型32, 516, 38, 064, 1921, 31, 20.892 70.894 0
11单阶段模型3216864110.768 20.752 6
表 4  不同网络结构的对比实验结果
模型输入类型时间帧FullSMICCASME ⅡSAMM
UF1UARUF1UARUF1UARUF1UAR
RGB图像峰值帧0.761 90.758 10.648 70.647 00.903 40.893 60.743 50.727 4
光流峰值帧0.887 00.885 40.820 70.822 70.965 10.962 00.888 90.873 5
光流+光应变均匀4帧0.841 10.831 70.803 60.804 70.832 40.804 40.879 30.871 6
均匀2帧0.853 70.857 50.825 20.833 30.845 70.850 00.890 60.867 8
峰值帧0.910 20.910 20.851 20.854 70.985 80.985 80.908 00.893 6
表 5  不同模型输入的对比实验结果
图 6  所提方法在不同数据集上的的混淆矩阵
模型序号注意力映射
方式
分类器
输入
FullSMICCASME ⅡSAMM
UF1UARUF1UARUF1UARUF1UAR
1Linearclass token0.883 90.882 90.818 90.820 80.957 40.957 40.882 80.867 8
2Conv_BNclass token0.904 90.905 30.842 00.848 20.978 50.975 40.903 00.874 8
3InnerBN_DWConvclass token0.907 10.906 20.848 50.852 90.978 90.982 00.903 10.871 4
4InnerBN_DWConvSeqSoftmax-g0.909 80.909 80.846 60.851 70.978 00.968 80.921 20.915 8
5InnerBN_DWConvSeqSoftmax-x0.910 20.910 20.851 20.854 70.985 80.985 80.908 00.893 6
表 6  模型中各模块的消融实验结果
图 7  所提方法的注意力可视化结果
1 LIONG S T, GAN Y, SEE J, et al. Shallow triple stream three-dimensional CNN (STSTNet) for micro-expression recognition [C]// Proceedings of the 14th IEEE International Conference on Automatic Face & Gesture Recognition. Lille: IEEE, 2019: 1–5.
2 PORTER S, TEN BRINKE L Reading between the lies: identifying concealed and falsified emotions in universal facial expressions[J]. Psychological Science, 2008, 19 (5): 508- 514
doi: 10.1111/j.1467-9280.2008.02116.x
3 ZHAO G, PIETIKAINEN M Dynamic texture recognition using local binary patterns with an application to facial expressions[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29 (6): 915- 928
doi: 10.1109/TPAMI.2007.1110
4 OJALA T, PIETIKAINEN M, MAENPAA T Multiresolution gray-scale and rotation invariant texture classification with local binary patterns[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24 (7): 971- 987
doi: 10.1109/TPAMI.2002.1017623
5 CHAUDHRY R, RAVICHANDRAN A, HAGER G, et al. Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009: 1932–1939.
6 LIU Y, ZHANG J, YAN W, et al A main directional mean optical flow feature for spontaneous micro-expression recognition[J]. IEEE Transactions on Affective Computing, 2016, 7 (4): 299- 310
doi: 10.1109/TAFFC.2015.2485205
7 LIONG S T, SEE J, WONG K, et al Less is more: micro-expression recognition from video using apex frame[J]. Signal Processing: Image Communication, 2018, 62: 82- 92
doi: 10.1016/j.image.2017.11.006
8 VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc, 2017: 6000–6010.
9 WANG C, PENG M, BI T, et al Micro-attention for micro-expression recognition[J]. Neurocomputing, 2020, 410: 354- 362
doi: 10.1016/j.neucom.2020.06.005
10 XIA B, WANG W, WANG S, et al. Learning from macro-expression: a micro-expression recognition framework [C]// Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020: 2936–2944.
11 XIA B, WANG S. Micro-expression recognition enhanced by macro-expression from spatial-temporal domain [C]// Proceedings of the 30th International Joint Conference on Artificial Intelligence. Montreal: IJCAI, 2021: 1186–1193.
12 HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
13 GAN Y S, LIONG S T, YAU W, et al OFF-ApexNet on micro-expression recognition system[J]. Signal Processing: Image Communication, 2019, 74: 129- 139
doi: 10.1016/j.image.2019.02.005
14 KHOR H Q, SEE J, LIONG S T, et al. Dual-stream shallow networks for facial micro-expression recognition [C]// Proceedings of the IEEE International Conference on Image Processing. Taipei: IEEE, 2019: 36–40.
15 CHEN B, LIU K, XU Y, et al Block division convolutional network with implicit deep features augmentation for micro-expression recognition[J]. IEEE Transactions on Multimedia, 2023, 25: 1345- 1358
doi: 10.1109/TMM.2022.3141616
16 DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: learning optical flow with convolutional networks [C]// Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2016: 2758–2766.
17 梁岩, 黄润才, 卢士铖 基于改进3D ResNet18的多模态微表情识别[J]. 计算机应用研究, 2025, 42 (3): 903- 910
LIANG Yan, HUANG Runcai, LU Shicheng Multimodal micro-expression recognition based on improved 3D ResNet18[J]. Application Research of Computers, 2025, 42 (3): 903- 910
doi: 10.19734/j.issn.1001-3695.2024.04.0216
18 ZHANG L, HONG X, ARANDJELOVIĆ O, et al Short and long range relation based spatio-temporal Transformer for micro-expression recognition[J]. IEEE Transactions on Affective Computing, 2022, 13 (4): 1973- 1985
doi: 10.1109/TAFFC.2022.3213509
19 HOCHREITER S, SCHMIDHUBER J Long short-term memory[J]. Neural Computation, 1997, 9 (8): 1735- 1780
doi: 10.1162/neco.1997.9.8.1735
20 FAN Y, JIA M, ZHANG Y, et al. Micro-expression recognition using pre-trained model and Transformer [C]// Proceedings of the IEEE 4th International Conference on Civil Aviation Safety and Information Technology. Dali: IEEE, 2022: 1404–1408.
21 DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [EB/OL]. (2021-06-03) [2025-08-05]. https://arxiv.org/abs/2010.11929.
22 WANG Z, ZHANG K, LUO W, et al HTNet for micro-expression recognition[J]. Neurocomputing, 2024, 602: 128196
doi: 10.1016/j.neucom.2024.128196
23 ZHOU H, HUANG S, XU Y IncepTR: micro-expression recognition integrating inception-CBAM and vision Transformer[J]. Multimedia Systems, 2023, 29 (6): 3863- 3876
doi: 10.1007/s00530-023-01164-0
24 WOO S, PARK J, LEE J Y, et al: CBAM: convolutional block attention module [C]// Proceedings of the European Conference on Computer Vision. Munich: Springer, 2018: 3–19.
25 SZEGEDY C, LIU W, JIA Y, et al. Going deeper with convolutions [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015: 1–9.
26 XUE F, WANG Q, GUO G. TransFER: learning relation-aware facial expression representations with Transformers [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 3581–3590.
27 张波, 武瑀繁 基于双分支轻量化网络的微表情识别算法[J]. 激光与光电子学进展, 2024, 61 (14): 1437001
ZHANG Bo, WU Yufan Microexpression recognition algorithm based on a two-branch lightweight network[J]. Laser & Optoelectronics Progress, 2024, 61 (14): 1437001
28 SANDLER M, HOWARD A, ZHU M, et al. MobileNetV2: inverted residuals and linear bottlenecks [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 4510–4520.
29 NGUYEN X, DUONG C, LI X, et al. Micron-BERT: BERT-based facial micro-expression recognition [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver: IEEE, 2023: 1482–1492.
30 BAO H, DONG L, PIAO S, et al. BEiT: BERT pre-training of image Transformers [EB/OL]. (2022-09-03) [2025-08-05]. https://arxiv.org/abs/2106.08254.
31 MITCHELL T M. The need for biases in learning generalizations [R]. New Jersey: Rutgers University, 1980.
32 WU H, XIAO B, CODELLA N, et al. CvT: introducing convolutions to vision Transformers [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 22–31.
33 SHREVE M, GODAVARTHY S, GOLDGOF D, et al. Macro- and micro-expression spotting in long videos using spatio-temporal strain [C]// Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition. Santa Barbara: IEEE, 2011: 51–56.
34 安晶晶, 刘高平, 朱佳宁 Farneback光流法在短临预报中的应用[J]. 软件, 2018, 39 (10): 18- 25
AN Jingjing, LIU Gaoping, ZHU Jianing Application of farneback optical flow method in nowcasting[J]. Computer Engineering & Software, 2018, 39 (10): 18- 25
doi: 10.3969/j.issn.1003-6970.2018.10.005
35 GANIN Y, LEMPITSKY V. Unsupervised domain adaptation by backpropagation [C]// Proceedings of the International Conference on Machine Learning. Lille: JMLR, 2015: 1180–1189.
36 IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift [C]// Proceedings of the International Conference on Machine Learning. Lille: JMLR, 2015: 448–456.
37 SIFRE L, MALLAT S. Rigid-motion scattering for texture classification [EB/OL]. (2014-03-07) [2025-08-05]. https://arxiv.org/abs/1403.1687.
38 LI X, PFISTER T, HUANG X, et al. A spontaneous micro-expression database: inducement, collection and baseline [C]// Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition. Shanghai: IEEE, 2013: 1–6.
39 YAN W, LI X, WANG S, et al CASME II: an improved spontaneous micro-expression database and the baseline evaluation[J]. PLoS One, 2014, 9 (1): e86041
doi: 10.1371/journal.pone.0086041
40 DAVISON A K, LANSLEY C, COSTEN N, et al SAMM: a spontaneous micro-facial movement dataset[J]. IEEE Transactions on Affective Computing, 2018, 9 (1): 116- 129
doi: 10.1109/TAFFC.2016.2573832
41 FU R, HU Q, DONG X, et al. Axiom-based Grad-CAM: towards accurate visualization and explanation of CNNs [C]// Proceedings of the British Machine Vision Conference. [S.l.]: BMVA, 2020: 146.
[1] 张乃洲,赵云超,曹薇,张啸剑. 基于多视图跨模态特征融合的图像描述生成[J]. 浙江大学学报(工学版), 2026, 60(6): 1205-1212.
[2] 李云红,张琪琪,陈锦妮,陈伟重,苏雪平,梁成名. 基于生成对抗网络和坐标注意力机制的文本生成图像算法[J]. 浙江大学学报(工学版), 2026, 60(6): 1213-1220.
[3] 李国燕,于威,梅玉鹏,张明辉,王新强. 全局局部特征融合的遥感图像建筑物提取[J]. 浙江大学学报(工学版), 2026, 60(5): 1100-1108.
[4] 宋耀莲,彭驰,唐菁敏,赵宣植,虞贵财. 基于融合注意力机制的光学遥感图像小目标检测算法[J]. 浙江大学学报(工学版), 2026, 60(4): 763-771.
[5] 万刚,王小波,石纲,叶德震,朱思思,司帆. 基于特征细化与注意力增强重构的水下图像增强算法[J]. 浙江大学学报(工学版), 2026, 60(4): 800-811.
[6] 陈文强,冯琳越,王东丹,顾玉磊,赵轩. 融合动态风险图与多变量注意力机制的车辆轨迹预测模型[J]. 浙江大学学报(工学版), 2026, 60(3): 455-467.
[7] 胡从裕,殷晨波,马伟,杨超,颜士宽. 基于改进CNN-LSTM的挖掘机作业对象识别[J]. 浙江大学学报(工学版), 2026, 60(3): 536-545.
[8] 李彬彬,张超,覃涛,陈昌盛,刘兴艳,杨靖. 面向光伏电站建设的移动端人体跌倒检测方法[J]. 浙江大学学报(工学版), 2026, 60(3): 546-555.
[9] 李国燕,李鹏辉,刘榕,梅玉鹏,张明辉. 融合多尺度分辨率和带状特征的遥感道路提取[J]. 浙江大学学报(工学版), 2026, 60(3): 585-593.
[10] 方芳,严军,郭红想,王勇. 基于时空注意力机制的轻量级脑纹识别算法[J]. 浙江大学学报(工学版), 2026, 60(3): 633-642.
[11] 王爽,章熙泰,郭永存,孙守锁. 基于深度网络的可控混合式磁力耦合器退磁诊断[J]. 浙江大学学报(工学版), 2026, 60(2): 279-286.
[12] 李宪华,杜鹏飞,宋韬,邱洵,蔡钰. 基于多尺度滑窗注意力时序卷积网络的脑电信号分类[J]. 浙江大学学报(工学版), 2026, 60(2): 370-378.
[13] 杨明辉,宋牧原,付大喜,郭炎伟,卢贤锥,张文聪,郑伟龙. 基于多头自注意力-Bi-LSTM模型的盾构掘进引发的土体沉降预测[J]. 浙江大学学报(工学版), 2026, 60(2): 415-424.
[14] 周思瑶,夏楠,江佳鸿. 姿态引导的双分支换装行人重识别网络[J]. 浙江大学学报(工学版), 2026, 60(1): 71-80.
[15] 李琦媛,程鑫,马文清,张开淦,夏唐斌,奚立峰. 联邦协作框架下的跨工况半监督剩余使用寿命预测[J]. 浙江大学学报(工学版), 2026, 60(1): 127-137.