Multimodal cascaded document layout analysis network based on Transformer

doi:10.3785/j.issn.1008-973X.2024.02.010

Journal of ZheJiang University (Engineering Science)

2024, Vol. 58

Issue (2): 317-324 DOI: 10.3785/j.issn.1008-973X.2024.02.010

Multimodal cascaded document layout analysis network based on Transformer

Shaojie WEN1,2(

),Ruigang WU1,2,Chaowen FENG1,2,Yingli LIU1,2,*(

)

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming 650500, China

Download:

HTML

PDF(2183KB) HTML
Export: BibTeX | EndNote (RIS)

Abstract

The multimodal cascaded document layout analysis network (MCOD-Net) based on Transformer was proposed in order to solve the issue of misalignment in the existing methods for pretraining objectives in both text and image modalities, which involve complex preprocessing of document images using convolutional neural network (CNN) structures leading to many model parameters. The word block alignment embedding module (WAEM) was introduced to achieve alignment embedding of the pretraining objectives for text and image modalities. Masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA) were utilized for pretraining in order to enhance the model’s representation learning capabilities across text and image modalities. The model structure was simplified and the number of model parameters was reduced by directly using the original document images and representing them using linear projected features of image blocks. The experimental results demonstrate that the proposed model achieves an mean average precision (mAP) of 95.1% on the publicly available PubLayNet dataset. A 2.5% overall performance improvement was achieved with outstanding generalization ability and exhibiting the best comprehensive performance compared with other models.

Key words： document layout analysis word-block alignment embedding Transformer MCOD-Net model

Received: 26 May 2023 Published: 23 January 2024

CLC:

TP 391

Fund: 国家自然科学基金资助项目(52061020，61971208); 云南计算机技术应用重点实验室开放基金资助项目(2020103); 云南省重大科技专项资助项目(202302AG050009)

Corresponding Authors: Yingli LIU E-mail: wenshaojie@stu.kust.edu.cn;lyl@kust.edu.cn

	Service
	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Shaojie WEN
	Ruigang WU
	Chaowen FENG
	Yingli LIU

Cite this article:

Shaojie WEN,Ruigang WU,Chaowen FENG,Yingli LIU. Multimodal cascaded document layout analysis network based on Transformer. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 317-324.

URL:

https://www.zjujournals.com/eng/10.3785/j.issn.1008-973X.2024.02.010 OR https://www.zjujournals.com/eng/Y2024/V58/I2/317

基于Transformer的多模态级联文档布局分析网络

针对现有方法在文本和图像模态的预训练目标上存在嵌入不对齐，文档图像采用基于卷积神经网络(CNN)的结构进行预处理，流程复杂，模型参数量大的问题，提出基于Transformer的多模态级联文档布局分析网络(MCOD-Net). 设计词块对齐嵌入模块(WAEM)，实现文本和图像模态预训练目标的对齐嵌入，使用掩码语言建模(MLM)、掩码图像建模(MIM)和词块对齐(WPA)进行预训练，以促进模型在文本和图像模态上的表征学习能力. 直接使用文档原始图像，用图像块的线性投影特征来表示文档图像，简化模型结构，减小了模型参数量. 实验结果表明，所提模型在PubLayNet公开数据集上的平均精度均值(mAP)达到95.1%. 相较于其他模型，整体性能提升了2.5%，泛化能力突出，综合效果最优.

关键词： 文档布局分析, 词块对齐嵌入, Transformer, MCOD-Net模型

Fig.1 Architecture diagram of MCOD-NET model

Tab.1 Comparison of parameter sizes for different image processing methods

Fig.2 PubLayNet dataset and corresponding labels

Tab.2 Overall performance of proposed model and existing models on PublayNet dataset

Tab.3 mAP values of proposed model and existing models for identification of various elements in PublayNet dataset

Fig.3 Recognition results of various elements

Fig.4 Visualization of MCON-Net model on PublayNet dataset

Fig.5 Visualization of MDOC-Net model on aluminum-silicon alloy literature


[1]	SOTO C, YOO S. Visual detection with context for document layout analysis [C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 3464-3470.

[2]	WATANABE T, LUO Q, SUGIE N Structure recognition methods for various types of documents[J]. Machine Vision and Applications, 1993, 6 (2/3): 163- 176

[3]	HIRAYAMA Y. A method for table structure analysis using DP matching[C]//Proceedings of 3rd International Conference on Document Analysis and Recognition. Montreal: IEEE, 1995: 583-586.

[4]	FANG J, GAO L, BAI K, et al. A table detection method for multipage pdf documents via visual seperators and tabular structures [C]//2011 International Conference on Document Analysis and Recognition. Beijing: IEEE, 2011: 779-783.

[5]	BUNKE H, RIESEN K Recent advances in graph-based pattern recognition with applications in document analysis[J]. Pattern Recognition, 2011, 44 (5): 1057- 1067 doi: 10.1016/j.patcog.2010.11.015

[6]	HINTON G E, SALAKHUTDINOV R R Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313 (5786): 504- 507 doi: 10.1126/science.1127647

[7]	张真, 李宁, 田英爱基于双向LSTM网络的流式文档结构识别[J]. 计算机工程, 2020, 46 (1): 60- 66 ZHANG Zhen, LI Ning, TIAN Yingai Stream document structure recognition based on bidirectional LSTM network[J]. Computer Engineering, 2020, 46 (1): 60- 66 doi: 10.19678/j.issn.1000-3428.0053702

[8]	SAHA R, MONDAL A, JAWAHAR C V. Graphical object detection in document images [C]//International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 51-58.

[9]	GIRSHICK R. Fast r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 1440-1448.

[10]	HE K, GKIOXARI G, DOLLÁR P, et al. Mask r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2961-2969.

[11]	RIBA P, DUTTA A, GOLDMANN L, et al. Table detection in invoice documents by graph neural networks [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 122-127.

[12]	XU Y, LI M, CUI L, et al. Layoutlm: pre-training of text and layout for document image understanding [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 2020: 1192-1200.

[13]	LI J, XU Y, LV T, et al. Dit: self-supervised pre-training for document image transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 3530-3539.

[14]	APPALARAJU S, JASANI B, KOTA B U, et al. Docformer: end-to-end transformer for document understanding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 993-1003.

[15]	PHAM V, PHAM C, DANG T. Road damage detection and classification with detectron2 and faster r-cnn [C]//2020 IEEE International Conference on Big Data. Atlanta: IEEE, 2020: 5592-5601.

[16]	CAI Z, VASCONCELOS N. Cascade r-cnn: delving into high quality object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6154-6162.

[17]	RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 12179-12188.

[18]	KIM W, SON B, KIM I. Vilt: vision-and-language transformer without convolution or region supervision [C]// Proceedings of the 38th International Conference on Machine Learning. [S. l.]: PMLR, 2021: 5583-5594.

[19]	GHIASI G, LIN T Y, LE Q V. Nas-FPN: learning scalable feature pyramid architecture for object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE , 2019: 7036-7045.

[20]	KAWINTIRANON K, SINGH L. Knowledge enhanced masked language model for stance detection [C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Mexico: ACL, 2021: 4725-4735.

[21]	XIE Z, ZHANG Z, CAO Y, et al. Simmim: a simple framework for masked image modeling [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 9653-9663.

[22]	HUANG Y, LV T, CUI L, et al. Layoutlmv3: pre-training for document ai with unified text and image masking [C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 4083-4091.

[23]	BAO H, DONG L, PIAO S, et al. Beit: Bert pre-training of image transformers [EB/OL]. [2022-09-03]. https://arxiv.org/abs/2106.08254.

[24]	ZHONG X, TANG J, YEPES A J. Publaynet: largest dataset ever for document layout analysis [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 1015-1022.

[1]	Zhicheng FENG,Jie YANG,Zhichao CHEN. Urban road network extraction method based on lightweight Transformer[J]. Journal of ZheJiang University (Engineering Science), 2024, 58(1): 40-49.

[2]	Hai-bo ZHANG,Lei CAI,Jun-ping REN,Ru-yan WANG,Fu LIU. Efficient and adaptive semantic segmentation network based on Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(6): 1205-1214.

[3]	Yu-xiang WANG,Zhi-wei ZHONG,Peng-cheng XIA,Yi-xiang HUANG,Cheng-liang LIU. Compound fault decoupling diagnosis method based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 855-864.

[4]	Xin-dong LV,Jiao LI,Zhen-nan DENG,Hao FENG,Xin-tong CUI,Hong-xia DENG. Structured image super-resolution network based on improved Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(5): 865-874.

[5]	Yu-xiang LU,Guan-hua XU,Bo TANG. Worker behavior recognition based on temporal and spatial self-attention of vision Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(3): 446-454.

[6]	Qiao-hong CHEN,Jia-jin SUN,Yang-bo LOU,Zhi-jian FANG. Multimodal sentiment analysis model based on multi-task learning and stacked cross-modal Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(12): 2421-2429.

[7]	Jin-bo HU,Wei-zhi NIE,Dan SONG,Zhuo GAO,Yun-peng BAI,Feng ZHAO. Chest X-ray imaging disease diagnosis model assisted by deformable Transformer[J]. Journal of ZheJiang University (Engineering Science), 2023, 57(10): 1923-1932.

[8]	Wan-liang WANG,Tie-jun WANG,Jia-cheng CHEN,Wen-bo YOU. Medical image segmentation method combining multi-scale and multi-head attention[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(9): 1796-1805.

[9]	Guo-peng ZHANG,Zi-han LI,Hao WANG,zheng ZHENG. Isolated AC-DC solid state transformer front and rear stages integrated sliding mode control[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(3): 622-630.

[10]	Tian-le YUAN,Ju-long YUAN,Yong-jian ZHU,Han-chen ZHENG. Surface defect detection algorithm of thrust ball bearing based on improved YOLOv5[J]. Journal of ZheJiang University (Engineering Science), 2022, 56(12): 2349-2357.

[11]	Zhen-hong MA,Zhen LIU,Sheng-yong YIN,Rong-wei MA,Ke-ping YAN. Experimental study on melanoma cell ablation by high-voltage nanosecond pulsed electric field[J]. Journal of ZheJiang University (Engineering Science), 2021, 55(6): 1168-1174.

[12]	Le XIE,Xi-dan HENG,Yang LIU,Qi-long JIANG,Dong LIU. Transformer fault diagnosis based on linear discriminant analysis and step-by-step machine learning[J]. Journal of ZheJiang University (Engineering Science), 2020, 54(11): 2266-2272.

[13]	SU Guo-dong, SUN Ling-ling, WANG Xiang, WANG Zun-feng, ZHANG Sheng-zhou, LEI Yu-chao. Design of 126.6-128.1 GHz fundamental voltage control oscillator[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(9): 1788-1795.

[14]	LIU Xin, ZHENG Xiang-jie, HOU Qing-hui, SHI Jian-jiang. Current-sharing characteristic of converter composed of LLC with series-parallel transformer and interleaved Buck[J]. Journal of ZheJiang University (Engineering Science), 2018, 52(4): 806-818.

[15]	ZHU Ming-lei, ZHAO Rong-xiang, YANG Huan. Power electronic transformer using multi-pulse rectification technique[J]. Journal of ZheJiang University (Engineering Science), 2017, 51(9): 1861-1869.

Viewed

Full text

Abstract

Cited

Shared

Discussed