Please wait a minute...
浙江大学学报(工学版)  2026, Vol. 60 Issue (2): 388-395    DOI: 10.3785/j.issn.1008-973X.2026.02.017
计算机技术与控制工程     
基于大语言模型的中文隐喻多维度评估
黄孝喜(),查正超,陆诗佳
杭州电子科技大学 计算机学院,浙江 杭州 310018
Multi-dimensional evaluation of Chinese metaphors based on large language models
Xiaoxi HUANG(),Zhengchao ZHA,Shijia LU
School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
 全文: PDF(726 KB)   HTML
摘要:

探讨大语言模型(LLMs)在中文隐喻句子质量评估中的应用. 结合以往工作和认知语言学知识,制定中文隐喻的多维度评估标准. 按照该标准构建高质量的人工评估数据集作为基准,以验证大语言模型在中文隐喻评估任务上的表现. 以概念隐喻理论为指导,将多轮对话和思维链提示相结合,提出基于大语言模型的中文隐喻多维度评估框架. 实验结果显示,大语言模型在直接评分任务上与人工评分结果的皮尔逊相关系数为0.807,在组内择优任务上与人工评分结果的卡帕系数为0.831;大语言模型的评估结果与人工评分结果的一致性极高. 所提评估框架结合概念隐喻理论与大语言模型,能够出色地完成中文隐喻评估任务.

关键词: 隐喻评估中文隐喻概念隐喻理论大语言模型(LLM)提示工程多轮对话思维链    
Abstract:

The application of large language models (LLMs) was studied in evaluating the quality of Chinese metaphorical sentences. Building upon prior research and insights from cognitive linguistics, a multi-dimensional evaluation framework tailored to Chinese metaphors was developed. A high-quality human-annotated dataset was constructed based on the framework to serve as a benchmark for validating LLM performance in metaphor assessment tasks. Guided by conceptual metaphor theory, an LLM-based evaluation pipeline was proposed that integrates multi-turn dialogue and chain-of-thought prompting. Experiments were conducted to test the model’s effectiveness in two distinct tasks: direct scoring of metaphor quality and pairwise comparison for selecting superior metaphors within groups. Results demonstrate strong alignment between LLM evaluations and human judgments. In direct scoring tasks, the Pearson correlation coefficient reached 0.807, and for within-group selection tasks, Cohen’s Kappa coefficient of 0.831 was achieved. The proposed evaluation pipeline integrated conceptual metaphor theory with LLMs and achieved strong results on Chinese metaphor assessment.

Key words: metaphor evaluation    Chinese metaphor    conceptual metaphor theory    large language model (LLM)    prompt engineering    multi-turn dialogue    chain-of-thought
收稿日期: 2025-03-05 出版日期: 2026-02-03
CLC:  TP 391  
基金资助: 教育部人文社会科学研究规划基金项目(18YJA740016).
作者简介: 黄孝喜(1979—),男,副教授,博士,从事自然语言处理研究. orcid.org/0000-0003-4483-3664. E-mail:huangxx@hdu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
黄孝喜
查正超
陆诗佳

引用本文:

黄孝喜,查正超,陆诗佳. 基于大语言模型的中文隐喻多维度评估[J]. 浙江大学学报(工学版), 2026, 60(2): 388-395.

Xiaoxi HUANG,Zhengchao ZHA,Shijia LU. Multi-dimensional evaluation of Chinese metaphors based on large language models. Journal of ZheJiang University (Engineering Science), 2026, 60(2): 388-395.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2026.02.017        https://www.zjujournals.com/eng/CN/Y2026/V60/I2/388

图 1  提示语框架流程示意图
图 2  分阶段渐进式提示语示例
图 3  大语言模型隐喻评估结果输出示例
隐喻句子创新性关联性明确性文学性文化契合度总得分
人潮卷来卷去,地坝变成了露天舞台.4.08.08.08.07.035.0
字迹写得东扭西歪,像被狂风吹过的小草.6.57.08.57.58.037.5
这座城市拥抱着所有来到这里的人,像个温暖的母亲.5.08.08.07.06.034.0
表 1  隐喻评估结果示例
图 4  协同仲裁评估框架流程示意图
模型创新性关联性明确性文学性文化契合度总得分
GPT-47.3508.7008.9008.4508.60042.000
ERNIE-4.05.1507.2007.8006.7507.70034.550
Qwen2.56.8008.1008.3008.0508.15039.400
GLM-4-Plus6.4008.1508.1507.9007.95038.550
人工基准4.7507.7057.5756.8757.57533.525
表 2  不同模型对中文隐喻句子的评分结果
评分主体A评分主体Brp
Qwen2.5GLM-4-Plus0.8910.008 7
ERNIE-4.0GPT-40.8780.008 2
ERNIE-4.0Qwen2.50.9130.007 3
GPT-4人工基准0.7660.011 9
ERNIE-4.0人工基准0.8070.010 1
Qwen2.5人工基准0.7830.016 3
GLM-4-Plus人工基准0.7780.014 5
表 3  不同评分主体之间的皮尔逊相关系数
1 SHUTOVA E Design and evaluation of metaphor processing systems[J]. Computational Linguistics, 2015, 41 (4): 579- 623
doi: 10.1162/COLI_a_00233
2 PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation [C]// Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. ACM, 2002: 311–318.
3 LIN C Y. ROUGE: a package for automatic evaluation of summaries [C]// Proceedings of the Annual Meeting of the Association for Computational Linguistics. Barcelona: ACL, 2004: 74–81.
4 LI Y, LIN C, GUERIN F. Nominal metaphor generation with multitask learning [C]// Proceedings of the 15th International Conference on Natural Language Generation. Waterville: ACL, 2022: 225–235.
5 ZHANG Z, HAN X, ZHOU H, et al CPM: a large-scale generative Chinese pre-trained language model[J]. AI Open, 2021, 2: 93- 99
doi: 10.1016/j.aiopen.2021.07.001
6 LI J, GALLEY M, BROCKETT C, et al. A diversity-promoting objective function for neural conversation models [C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. San Diego: ACL, 2016: 110–119.
7 CHAKRABARTY T, ZHANG X, MURESAN S, et al. MERMAID: metaphor generation with symbolism and discriminative decoding [C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. [S.l.]: ACL, 2021: 4250–4261.
8 REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using siamese BERT-networks [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 3980–3990.
9 ZHANG T, KISHORE V, WU F, et al. BERTScore: evaluating text generation with BERT [EB/OL]. (2020–02–24)[2025–04–27]. https://arxiv.org/pdf/1904.09675.
10 DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis: ACL, 2019: 4171–4186.
11 HEINTZ I, GABBARD R, SRIVASTAVA M, et al. Automatic extraction of linguistic metaphors with LDA topic modeling [C]// Proceedings of the First Workshop on Metaphor in NLP. Atlanta: ACL, 2013: 58–66.
12 DISTEFANO P V, PATTERSON J D, BEATY R E Automatic scoring of metaphor creativity with large language models[J]. Creativity Research Journal, 2025, 37 (4): 555- 569
doi: 10.1080/10400419.2024.2326343
13 CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [S.l.]: ACL, 2020: 8440–8451.
14 RADFORD A, WU J, CHILD R, et al Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1 (8): 9
15 LIU Y, ITER D, XU Y, et al. G-EVAL: NLG evaluation using GPT-4 with better human alignment [C]// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: ACL, 2023: 2511–2522.
16 OpenAI, ACHIAM J, ADLER S, et al. GPT-4 technical report [EB/OL]. (2024–03–04)[2025–04–27]. https://arxiv.org/pdf/2303.08774.
17 WANG J, WANG J, ZHANG X. Chinese metaphor recognition using a multi-stage prompting large language model [C]// Natural Language Processing and Chinese Computing. Singapore: Springer, 2025: 234–246.
18 TONG X, CHOENNI R, LEWIS M, et al. Metaphor understanding challenge dataset for LLMs [C]// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Bangkok: ACL, 2024: 3517–3536.
19 GAO H, ZHANG J, ZHANG P, et al. Consistency rating of semantic transparency: an evaluation method for metaphor competence in idiom understanding tasks [C]// Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi: ACL, 2025: 10460–10471.
20 SHAO Y, YAO X, QU X, et al. CMDAG: a Chinese metaphor dataset with annotated grounds as cot for boosting metaphor generation [C]// Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Torino: [s.n.], 2024: 3357–3366.
21 WEI J, WANG X, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models [C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans: ACM, 2022: 24824–24837.
22 LAKOFF G, JOHNSON M. Metaphors we live by [M]. Chicago: University of Chicago Press, 2003.
23 LAKOFF G, JOHNSON M. Conceptual metaphor in everyday language [M]// SARASVATHY S, DEW N, VENKATARAMAN S. Shaping entrepreneurship research. London: Routledge, 2020: 475–504.
24 LAKOFF G. The contemporary theory of metaphor [M]. Cambridge: Cambridge University Press, 1993.
25 FAUCONNIER G, TURNER M. The way we think: conceptual blending and the mind’s hidden complexities [M]. New York: Basic Books, 2002.
26 KÖVECSES Z, BENCZES R. Metaphor: a practical introduction [M]. 2nd ed. Oxford: Oxford University Press, 2010.
27 GENTNER D, HOLYOAK K J, KOKINOV B N. The analogical mind: perspectives from cognitive science [M]. Cambridge: MIT Press, 2001.
28 KÖVECSES Z. Metaphor in culture: universality and variation [M]. Cambridge: Cambridge University Press, 2007.
29 PEARSON K Contributions to the mathematical theory of evolution[J]. Philosophical Transactions of the Royal Society of London Series A, 1894, 185: 71- 110
30 MCHUGH M L Interrater reliability: the kappa statistic[J]. Biochemia Medica, 2012, 22 (3): 276- 282
31 张明昊, 张东瑜, 林鸿飞. 基于 HowNet 的无监督汉语动词隐喻识别方法[C]// 第二十届中国计算语言学大会论文集. 呼和浩特: [s.n.], 2021: 258–268.
ZHANG Minghao, ZHANG Dongyu, LIN Hongfei. Unsupervised Chinese verb metaphor recognition method based on HowNet [C]// Proceedings of the 20th Chinese National Conference on Computational Linguistics. Hohhot: [s.n.], 2021: 258–268.
32 ZHANG Z, HAN X, LIU Z, et al. ERNIE: enhanced language representation with informative entities [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019: 1441–1451.
33 BAI J, BAI S, CHU Y, et al. Qwen technical report [EB/OL]. (2023–09–28)[2025–04–27]. https://arxiv.org/pdf/2309.16609.
34 Team GLM. ChatGLM: a family of large language models from GLM-130B to GLM-4 all tools [EB/OL]. (2024–07–30)[2025–04–27]. https://arxiv.org/pdf/2406.12793.
35 HADA R, GUMMA V, DE WYNTER A, et al. Are large language model-based evaluators the solution to scaling up multilingual evaluation? [C]// 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). [S.l.]: ACL, 2023: 1051–1070.
[1] 冯超文,耿程晨,刘英莉. 基于嵌入特征和稀疏矩阵的实体对齐方法[J]. 浙江大学学报(工学版), 2026, 60(2): 379-387.
[2] 李宪华,杜鹏飞,宋韬,邱洵,蔡钰. 基于多尺度滑窗注意力时序卷积网络的脑电信号分类[J]. 浙江大学学报(工学版), 2026, 60(2): 370-378.
[3] 杨冰,周家辉,姚金良,向学勤. 基于多模态语义信息的文本生成图像方法[J]. 浙江大学学报(工学版), 2026, 60(2): 360-369.
[4] 陈佳舟,朱肖航,徐阳辉,高崟,鲁一慧,毛真,李胜龙,章超权. 基于深度霍夫投票的建筑点云轻量级表面重建[J]. 浙江大学学报(工学版), 2026, 60(2): 341-350.
[5] 周华平,邓彬,孙克雷,张咏琪,吴涛,吴劲. 井下轨道区域高精度实时分割网络[J]. 浙江大学学报(工学版), 2026, 60(2): 332-340.
[6] 孟昱煜,孔垂乐,火久元,武泽宇. 重构YOLOv11的无人机小目标检测算法[J]. 浙江大学学报(工学版), 2026, 60(2): 303-312.
[7] 贾晓芬,王子祥,赵佰亭,梁镇洹,胡锐. 双维度交叉融合驱动的图像超分辨率重建方法[J]. 浙江大学学报(工学版), 2025, 59(12): 2516-2526.
[8] 朱志航,闫云凤,齐冬莲. 基于扩散模型多模态提示的电力人员行为图像生成[J]. 浙江大学学报(工学版), 2026, 60(1): 43-51.
[9] 肖剑,何昕泽,程鸿亮,杨小苑,胡欣. 基于多尺度特征增强的航拍小目标检测算法[J]. 浙江大学学报(工学版), 2026, 60(1): 19-31.
[10] 包晓安,彭书友,张娜,涂小妹,张庆琪,吴彪. 基于多方位感知深度融合检测头的目标检测算法[J]. 浙江大学学报(工学版), 2026, 60(1): 32-42.
[11] 孙月,张兴兰. 基于双重引导的目标对抗攻击方法[J]. 浙江大学学报(工学版), 2026, 60(1): 81-89.
[12] 王天飞,周文俊,项圣,贺宇航,彭博. 基于异常特征对抗学习的工业图像异常检测方法[J]. 浙江大学学报(工学版), 2025, 59(12): 2566-2575.
[13] 杨燕,贾存鹏. 代理注意力下域特征交互的高效图像去雾算法[J]. 浙江大学学报(工学版), 2025, 59(12): 2527-2538.
[14] 张捷皓,张进峰,吴威涛,向忠. 基于高精多尺度集成的轻量织物缺陷检测方法[J]. 浙江大学学报(工学版), 2025, 59(12): 2556-2565.
[15] 李亚,蒋晨,王海瑞,朱贵富,胡灿. 基于改进CenterNet算法的番茄叶片病害检测[J]. 浙江大学学报(工学版), 2025, 59(11): 2370-2378.