Please wait a minute...
浙江大学学报(工学版)  2024, Vol. 58 Issue (2): 317-324    DOI: 10.3785/j.issn.1008-973X.2024.02.010
计算机技术、通信技术     
基于Transformer的多模态级联文档布局分析网络
温绍杰1,2(),吴瑞刚1,2,冯超文1,2,刘英莉1,2,*()
1. 昆明理工大学 信息工程与自动化学院,云南 昆明 650500
2. 昆明理工大学 云南省计算机技术应用重点实验室,云南 昆明 650500
Multimodal cascaded document layout analysis network based on Transformer
Shaojie WEN1,2(),Ruigang WU1,2,Chaowen FENG1,2,Yingli LIU1,2,*()
1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming 650500, China
 全文: PDF(2183 KB)   HTML
摘要:

针对现有方法在文本和图像模态的预训练目标上存在嵌入不对齐, 文档图像采用基于卷积神经网络(CNN)的结构进行预处理,流程复杂,模型参数量大的问题,提出基于Transformer的多模态级联文档布局分析网络(MCOD-Net). 设计词块对齐嵌入模块(WAEM),实现文本和图像模态预训练目标的对齐嵌入,使用掩码语言建模(MLM)、掩码图像建模(MIM)和词块对齐(WPA)进行预训练,以促进模型在文本和图像模态上的表征学习能力. 直接使用文档原始图像,用图像块的线性投影特征来表示文档图像,简化模型结构,减小了模型参数量. 实验结果表明,所提模型在PubLayNet公开数据集上的平均精度均值(mAP)达到95.1%. 相较于其他模型,整体性能提升了2.5%,泛化能力突出,综合效果最优.

关键词: 文档布局分析词块对齐嵌入TransformerMCOD-Net模型    
Abstract:

The multimodal cascaded document layout analysis network (MCOD-Net) based on Transformer was proposed in order to solve the issue of misalignment in the existing methods for pretraining objectives in both text and image modalities, which involve complex preprocessing of document images using convolutional neural network (CNN) structures leading to many model parameters. The word block alignment embedding module (WAEM) was introduced to achieve alignment embedding of the pretraining objectives for text and image modalities. Masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA) were utilized for pretraining in order to enhance the model’s representation learning capabilities across text and image modalities. The model structure was simplified and the number of model parameters was reduced by directly using the original document images and representing them using linear projected features of image blocks. The experimental results demonstrate that the proposed model achieves an mean average precision (mAP) of 95.1% on the publicly available PubLayNet dataset. A 2.5% overall performance improvement was achieved with outstanding generalization ability and exhibiting the best comprehensive performance compared with other models.

Key words: document layout analysis    word-block alignment embedding    Transformer    MCOD-Net model
收稿日期: 2023-05-26 出版日期: 2024-01-23
CLC:  TP 391  
基金资助: 国家自然科学基金资助项目(52061020,61971208); 云南计算机技术应用重点实验室开放基金资助项目(2020103); 云南省重大科技专项资助项目(202302AG050009)
通讯作者: 刘英莉     E-mail: wenshaojie@stu.kust.edu.cn;lyl@kust.edu.cn
作者简介: 温绍杰(1999—),男,硕士生,从事智能文档信息提取的研究. orcid.org/0009-0004-1100-2092.E-mail:wenshaojie@stu.kust.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
作者相关文章  
温绍杰
吴瑞刚
冯超文
刘英莉

引用本文:

温绍杰,吴瑞刚,冯超文,刘英莉. 基于Transformer的多模态级联文档布局分析网络[J]. 浙江大学学报(工学版), 2024, 58(2): 317-324.

Shaojie WEN,Ruigang WU,Chaowen FENG,Yingli LIU. Multimodal cascaded document layout analysis network based on Transformer. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 317-324.

链接本文:

https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.02.010        https://www.zjujournals.com/eng/CN/Y2024/V58/I2/317

图 1  MCOD-NET模型的架构图
模型主干网络Np/106
ResNet-50CNN25
ResNet-101CNN44
ResNet-152CNN60
线性嵌入(本文方法)Linear0.6
表 1  不同图像处理方法的参数量对比
图 2  PubLayNet数据集及对应标签
模型主干网络mAP/%
PublayNet[24]Mask R-CNN91.0
DiT[13]Mask R-CNN91.6
DiT[13]Cascader R-CNN92.5
UDoc[25]Faster R-CNN91.7
BEiT[23]Mask R-CNN92.6
MCOD-NetCascader R-CNN95.1
表 2  所提模型与现有模型在PublayNet数据集上的整体性能
模型mAP/%
文本标题列表表格图片
PublayNet[24]91.684.088.696.094.9
DiT[13](Mask R-CNN)92.884.586.897.596.5
DiT[13](Cascader R-CNN)93.685.989.797.696.9
UDoc[25]92.686.592.496.595.4
BEiT[23]92.586.293.197.395.7
MCOD-Net94.490.595.497.897.0
表 3  所提模型与现有模型在PublayNet数据集上对各类元素识别的mAP值
图 3  各类元素的识别结果
图 4  MCON-Net模型对PublayNet数据集的可视化
图 5  MDOC-Net模型对铝硅合金文献的可视化
1 SOTO C, YOO S. Visual detection with context for document layout analysis [C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 3464-3470.
2 WATANABE T, LUO Q, SUGIE N Structure recognition methods for various types of documents[J]. Machine Vision and Applications, 1993, 6 (2/3): 163- 176
3 HIRAYAMA Y. A method for table structure analysis using DP matching[C]//Proceedings of 3rd International Conference on Document Analysis and Recognition. Montreal: IEEE, 1995: 583-586.
4 FANG J, GAO L, BAI K, et al. A table detection method for multipage pdf documents via visual seperators and tabular structures [C]//2011 International Conference on Document Analysis and Recognition. Beijing: IEEE, 2011: 779-783.
5 BUNKE H, RIESEN K Recent advances in graph-based pattern recognition with applications in document analysis[J]. Pattern Recognition, 2011, 44 (5): 1057- 1067
doi: 10.1016/j.patcog.2010.11.015
6 HINTON G E, SALAKHUTDINOV R R Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313 (5786): 504- 507
doi: 10.1126/science.1127647
7 张真, 李宁, 田英爱 基于双向LSTM网络的流式文档结构识别[J]. 计算机工程, 2020, 46 (1): 60- 66
ZHANG Zhen, LI Ning, TIAN Yingai Stream document structure recognition based on bidirectional LSTM network[J]. Computer Engineering, 2020, 46 (1): 60- 66
doi: 10.19678/j.issn.1000-3428.0053702
8 SAHA R, MONDAL A, JAWAHAR C V. Graphical object detection in document images [C]//International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 51-58.
9 GIRSHICK R. Fast r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 1440-1448.
10 HE K, GKIOXARI G, DOLLÁR P, et al. Mask r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2961-2969.
11 RIBA P, DUTTA A, GOLDMANN L, et al. Table detection in invoice documents by graph neural networks [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 122-127.
12 XU Y, LI M, CUI L, et al. Layoutlm: pre-training of text and layout for document image understanding [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 2020: 1192-1200.
13 LI J, XU Y, LV T, et al. Dit: self-supervised pre-training for document image transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 3530-3539.
14 APPALARAJU S, JASANI B, KOTA B U, et al. Docformer: end-to-end transformer for document understanding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 993-1003.
15 PHAM V, PHAM C, DANG T. Road damage detection and classification with detectron2 and faster r-cnn [C]//2020 IEEE International Conference on Big Data. Atlanta: IEEE, 2020: 5592-5601.
16 CAI Z, VASCONCELOS N. Cascade r-cnn: delving into high quality object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6154-6162.
17 RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 12179-12188.
18 KIM W, SON B, KIM I. Vilt: vision-and-language transformer without convolution or region supervision [C]// Proceedings of the 38th International Conference on Machine Learning. [S. l.]: PMLR, 2021: 5583-5594.
19 GHIASI G, LIN T Y, LE Q V. Nas-FPN: learning scalable feature pyramid architecture for object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE , 2019: 7036-7045.
20 KAWINTIRANON K, SINGH L. Knowledge enhanced masked language model for stance detection [C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Mexico: ACL, 2021: 4725-4735.
21 XIE Z, ZHANG Z, CAO Y, et al. Simmim: a simple framework for masked image modeling [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 9653-9663.
22 HUANG Y, LV T, CUI L, et al. Layoutlmv3: pre-training for document ai with unified text and image masking [C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 4083-4091.
23 BAO H, DONG L, PIAO S, et al. Beit: Bert pre-training of image transformers [EB/OL]. [2022-09-03]. https://arxiv.org/abs/2106.08254.
24 ZHONG X, TANG J, YEPES A J. Publaynet: largest dataset ever for document layout analysis [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 1015-1022.
[1] 冯志成,杨杰,陈智超. 基于轻量级Transformer的城市路网提取方法[J]. 浙江大学学报(工学版), 2024, 58(1): 40-49.
[2] 张海波,蔡磊,任俊平,王汝言,刘富. 基于Transformer的高效自适应语义分割网络[J]. 浙江大学学报(工学版), 2023, 57(6): 1205-1214.
[3] 王誉翔,钟智伟,夏鹏程,黄亦翔,刘成良. 基于改进Transformer的复合故障解耦诊断方法[J]. 浙江大学学报(工学版), 2023, 57(5): 855-864.
[4] 吕鑫栋,李娇,邓真楠,冯浩,崔欣桐,邓红霞. 基于改进Transformer的结构化图像超分辨网络[J]. 浙江大学学报(工学版), 2023, 57(5): 865-874.
[5] 陆昱翔,徐冠华,唐波. 基于视觉Transformer时空自注意力的工人行为识别[J]. 浙江大学学报(工学版), 2023, 57(3): 446-454.
[6] 陈巧红,孙佳锦,漏杨波,方志坚. 基于多任务学习与层叠 Transformer 的多模态情感分析模型[J]. 浙江大学学报(工学版), 2023, 57(12): 2421-2429.
[7] 胡锦波,聂为之,宋丹,高卓,白云鹏,赵丰. 可形变Transformer辅助的胸部X光影像疾病诊断模型[J]. 浙江大学学报(工学版), 2023, 57(10): 1923-1932.
[8] 王万良,王铁军,陈嘉诚,尤文波. 融合多尺度和多头注意力的医疗图像分割方法[J]. 浙江大学学报(工学版), 2022, 56(9): 1796-1805.
[9] 袁天乐,袁巨龙,朱勇建,郑翰辰. 基于改进YOLOv5的推力球轴承表面缺陷检测算法[J]. 浙江大学学报(工学版), 2022, 56(12): 2349-2357.
[10] 胡晨, 吴新科, 彭方正, 钱照明. 变压器级联的双路均流准谐振反激LED驱动器[J]. 浙江大学学报(工学版), 2015, 49(4): 740-748.