|
|
Multimodal cascaded document layout analysis network based on Transformer |
Shaojie WEN1,2( ),Ruigang WU1,2,Chaowen FENG1,2,Yingli LIU1,2,*( ) |
1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China 2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming 650500, China |
|
|
Abstract The multimodal cascaded document layout analysis network (MCOD-Net) based on Transformer was proposed in order to solve the issue of misalignment in the existing methods for pretraining objectives in both text and image modalities, which involve complex preprocessing of document images using convolutional neural network (CNN) structures leading to many model parameters. The word block alignment embedding module (WAEM) was introduced to achieve alignment embedding of the pretraining objectives for text and image modalities. Masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA) were utilized for pretraining in order to enhance the model’s representation learning capabilities across text and image modalities. The model structure was simplified and the number of model parameters was reduced by directly using the original document images and representing them using linear projected features of image blocks. The experimental results demonstrate that the proposed model achieves an mean average precision (mAP) of 95.1% on the publicly available PubLayNet dataset. A 2.5% overall performance improvement was achieved with outstanding generalization ability and exhibiting the best comprehensive performance compared with other models.
|
Received: 26 May 2023
Published: 23 January 2024
|
|
Fund: 国家自然科学基金资助项目(52061020,61971208); 云南计算机技术应用重点实验室开放基金资助项目(2020103); 云南省重大科技专项资助项目(202302AG050009) |
Corresponding Authors:
Yingli LIU
E-mail: wenshaojie@stu.kust.edu.cn;lyl@kust.edu.cn
|
基于Transformer的多模态级联文档布局分析网络
针对现有方法在文本和图像模态的预训练目标上存在嵌入不对齐, 文档图像采用基于卷积神经网络(CNN)的结构进行预处理,流程复杂,模型参数量大的问题,提出基于Transformer的多模态级联文档布局分析网络(MCOD-Net). 设计词块对齐嵌入模块(WAEM),实现文本和图像模态预训练目标的对齐嵌入,使用掩码语言建模(MLM)、掩码图像建模(MIM)和词块对齐(WPA)进行预训练,以促进模型在文本和图像模态上的表征学习能力. 直接使用文档原始图像,用图像块的线性投影特征来表示文档图像,简化模型结构,减小了模型参数量. 实验结果表明,所提模型在PubLayNet公开数据集上的平均精度均值(mAP)达到95.1%. 相较于其他模型,整体性能提升了2.5%,泛化能力突出,综合效果最优.
关键词:
文档布局分析,
词块对齐嵌入,
Transformer,
MCOD-Net模型
|
|
[1] |
SOTO C, YOO S. Visual detection with context for document layout analysis [C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 3464-3470.
|
|
|
[2] |
WATANABE T, LUO Q, SUGIE N Structure recognition methods for various types of documents[J]. Machine Vision and Applications, 1993, 6 (2/3): 163- 176
|
|
|
[3] |
HIRAYAMA Y. A method for table structure analysis using DP matching[C]//Proceedings of 3rd International Conference on Document Analysis and Recognition. Montreal: IEEE, 1995: 583-586.
|
|
|
[4] |
FANG J, GAO L, BAI K, et al. A table detection method for multipage pdf documents via visual seperators and tabular structures [C]//2011 International Conference on Document Analysis and Recognition. Beijing: IEEE, 2011: 779-783.
|
|
|
[5] |
BUNKE H, RIESEN K Recent advances in graph-based pattern recognition with applications in document analysis[J]. Pattern Recognition, 2011, 44 (5): 1057- 1067
doi: 10.1016/j.patcog.2010.11.015
|
|
|
[6] |
HINTON G E, SALAKHUTDINOV R R Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313 (5786): 504- 507
doi: 10.1126/science.1127647
|
|
|
[7] |
张真, 李宁, 田英爱 基于双向LSTM网络的流式文档结构识别[J]. 计算机工程, 2020, 46 (1): 60- 66 ZHANG Zhen, LI Ning, TIAN Yingai Stream document structure recognition based on bidirectional LSTM network[J]. Computer Engineering, 2020, 46 (1): 60- 66
doi: 10.19678/j.issn.1000-3428.0053702
|
|
|
[8] |
SAHA R, MONDAL A, JAWAHAR C V. Graphical object detection in document images [C]//International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 51-58.
|
|
|
[9] |
GIRSHICK R. Fast r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 1440-1448.
|
|
|
[10] |
HE K, GKIOXARI G, DOLLÁR P, et al. Mask r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2961-2969.
|
|
|
[11] |
RIBA P, DUTTA A, GOLDMANN L, et al. Table detection in invoice documents by graph neural networks [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 122-127.
|
|
|
[12] |
XU Y, LI M, CUI L, et al. Layoutlm: pre-training of text and layout for document image understanding [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 2020: 1192-1200.
|
|
|
[13] |
LI J, XU Y, LV T, et al. Dit: self-supervised pre-training for document image transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 3530-3539.
|
|
|
[14] |
APPALARAJU S, JASANI B, KOTA B U, et al. Docformer: end-to-end transformer for document understanding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 993-1003.
|
|
|
[15] |
PHAM V, PHAM C, DANG T. Road damage detection and classification with detectron2 and faster r-cnn [C]//2020 IEEE International Conference on Big Data. Atlanta: IEEE, 2020: 5592-5601.
|
|
|
[16] |
CAI Z, VASCONCELOS N. Cascade r-cnn: delving into high quality object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6154-6162.
|
|
|
[17] |
RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 12179-12188.
|
|
|
[18] |
KIM W, SON B, KIM I. Vilt: vision-and-language transformer without convolution or region supervision [C]// Proceedings of the 38th International Conference on Machine Learning. [S. l.]: PMLR, 2021: 5583-5594.
|
|
|
[19] |
GHIASI G, LIN T Y, LE Q V. Nas-FPN: learning scalable feature pyramid architecture for object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE , 2019: 7036-7045.
|
|
|
[20] |
KAWINTIRANON K, SINGH L. Knowledge enhanced masked language model for stance detection [C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Mexico: ACL, 2021: 4725-4735.
|
|
|
[21] |
XIE Z, ZHANG Z, CAO Y, et al. Simmim: a simple framework for masked image modeling [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 9653-9663.
|
|
|
[22] |
HUANG Y, LV T, CUI L, et al. Layoutlmv3: pre-training for document ai with unified text and image masking [C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 4083-4091.
|
|
|
[23] |
BAO H, DONG L, PIAO S, et al. Beit: Bert pre-training of image transformers [EB/OL]. [2022-09-03]. https://arxiv.org/abs/2106.08254.
|
|
|
[24] |
ZHONG X, TANG J, YEPES A J. Publaynet: largest dataset ever for document layout analysis [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 1015-1022.
|
|
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|