计算机技术、通信技术 |
|
|
|
|
基于Transformer的多模态级联文档布局分析网络 |
温绍杰1,2( ),吴瑞刚1,2,冯超文1,2,刘英莉1,2,*( ) |
1. 昆明理工大学 信息工程与自动化学院,云南 昆明 650500 2. 昆明理工大学 云南省计算机技术应用重点实验室,云南 昆明 650500 |
|
Multimodal cascaded document layout analysis network based on Transformer |
Shaojie WEN1,2( ),Ruigang WU1,2,Chaowen FENG1,2,Yingli LIU1,2,*( ) |
1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China 2. Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming 650500, China |
引用本文:
温绍杰,吴瑞刚,冯超文,刘英莉. 基于Transformer的多模态级联文档布局分析网络[J]. 浙江大学学报(工学版), 2024, 58(2): 317-324.
Shaojie WEN,Ruigang WU,Chaowen FENG,Yingli LIU. Multimodal cascaded document layout analysis network based on Transformer. Journal of ZheJiang University (Engineering Science), 2024, 58(2): 317-324.
链接本文:
https://www.zjujournals.com/eng/CN/10.3785/j.issn.1008-973X.2024.02.010
或
https://www.zjujournals.com/eng/CN/Y2024/V58/I2/317
|
1 |
SOTO C, YOO S. Visual detection with context for document layout analysis [C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong: ACL, 2019: 3464-3470.
|
2 |
WATANABE T, LUO Q, SUGIE N Structure recognition methods for various types of documents[J]. Machine Vision and Applications, 1993, 6 (2/3): 163- 176
|
3 |
HIRAYAMA Y. A method for table structure analysis using DP matching[C]//Proceedings of 3rd International Conference on Document Analysis and Recognition. Montreal: IEEE, 1995: 583-586.
|
4 |
FANG J, GAO L, BAI K, et al. A table detection method for multipage pdf documents via visual seperators and tabular structures [C]//2011 International Conference on Document Analysis and Recognition. Beijing: IEEE, 2011: 779-783.
|
5 |
BUNKE H, RIESEN K Recent advances in graph-based pattern recognition with applications in document analysis[J]. Pattern Recognition, 2011, 44 (5): 1057- 1067
doi: 10.1016/j.patcog.2010.11.015
|
6 |
HINTON G E, SALAKHUTDINOV R R Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313 (5786): 504- 507
doi: 10.1126/science.1127647
|
7 |
张真, 李宁, 田英爱 基于双向LSTM网络的流式文档结构识别[J]. 计算机工程, 2020, 46 (1): 60- 66 ZHANG Zhen, LI Ning, TIAN Yingai Stream document structure recognition based on bidirectional LSTM network[J]. Computer Engineering, 2020, 46 (1): 60- 66
doi: 10.19678/j.issn.1000-3428.0053702
|
8 |
SAHA R, MONDAL A, JAWAHAR C V. Graphical object detection in document images [C]//International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 51-58.
|
9 |
GIRSHICK R. Fast r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Santiago: IEEE, 2015: 1440-1448.
|
10 |
HE K, GKIOXARI G, DOLLÁR P, et al. Mask r-cnn [C]//Proceedings of the IEEE International Conference on Computer Vision. Venice: IEEE, 2017: 2961-2969.
|
11 |
RIBA P, DUTTA A, GOLDMANN L, et al. Table detection in invoice documents by graph neural networks [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 122-127.
|
12 |
XU Y, LI M, CUI L, et al. Layoutlm: pre-training of text and layout for document image understanding [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 2020: 1192-1200.
|
13 |
LI J, XU Y, LV T, et al. Dit: self-supervised pre-training for document image transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 3530-3539.
|
14 |
APPALARAJU S, JASANI B, KOTA B U, et al. Docformer: end-to-end transformer for document understanding [C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 993-1003.
|
15 |
PHAM V, PHAM C, DANG T. Road damage detection and classification with detectron2 and faster r-cnn [C]//2020 IEEE International Conference on Big Data. Atlanta: IEEE, 2020: 5592-5601.
|
16 |
CAI Z, VASCONCELOS N. Cascade r-cnn: delving into high quality object detection [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018: 6154-6162.
|
17 |
RANFTL R, BOCHKOVSKIY A, KOLTUN V. Vision transformers for dense prediction [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021: 12179-12188.
|
18 |
KIM W, SON B, KIM I. Vilt: vision-and-language transformer without convolution or region supervision [C]// Proceedings of the 38th International Conference on Machine Learning. [S. l.]: PMLR, 2021: 5583-5594.
|
19 |
GHIASI G, LIN T Y, LE Q V. Nas-FPN: learning scalable feature pyramid architecture for object detection [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE , 2019: 7036-7045.
|
20 |
KAWINTIRANON K, SINGH L. Knowledge enhanced masked language model for stance detection [C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Mexico: ACL, 2021: 4725-4735.
|
21 |
XIE Z, ZHANG Z, CAO Y, et al. Simmim: a simple framework for masked image modeling [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022: 9653-9663.
|
22 |
HUANG Y, LV T, CUI L, et al. Layoutlmv3: pre-training for document ai with unified text and image masking [C]//Proceedings of the 30th ACM International Conference on Multimedia. Lisson: ACM, 2022: 4083-4091.
|
23 |
BAO H, DONG L, PIAO S, et al. Beit: Bert pre-training of image transformers [EB/OL]. [2022-09-03]. https://arxiv.org/abs/2106.08254.
|
24 |
ZHONG X, TANG J, YEPES A J. Publaynet: largest dataset ever for document layout analysis [C]//2019 International Conference on Document Analysis and Recognition. Sydney: IEEE, 2019: 1015-1022.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|