1. School of Information Technology, Jiangxi University of Finance and Economics, Nanchang 330013, China 2. School of Software, Jiangxi Agricultural University, Nanchang 330013, China
The problem of generating structured query language (SQL) from natural language questions (Text-to-SQL) was analyzed. A two-stage framework was proposed to decouple the processes of schema linking and SQL generation in order to reduce the complexity of SQL generation. Database tables, columns, and values mentioned in the question were identified by a schema linker based on relational graph attention network in the first stage. The syntactic structure of the question and the internal relationships between database schema items were used to guide the model in learning the alignment between the question and the database. The original syntactic dependency tree was modified by merging relationships irrelevant to schema linking and adding dependencies between subordinating conjunctions in parallel structures and other elements in the sentence in view of the characteristics of Text-to-SQL task when constructing the question graph, which helps the model capture long-distance dependencies. SQL generation was performed by injecting the alignment information into the T5 encoder and fine-tuning it in the second stage. Experiments were conducted on the Spider, Spider-DK and Spider-Syn datasets. Results showed that the method performed well, especially for Text-to-SQL problems of medium difficulty and above.
Fig.1Two-stage SQL generation framework based on dependency graph attention network
Fig.2Structure of schema linker
关系
解释
dobj
间接宾语
nmod
名词修饰
compound
复合词
conj
连词
acl:relcl
关系从句修饰
amod
形容词修饰
nummod
量词修饰
Tab.1Collection of reserved dependencies
Fig.3Example of dependency modification
yi
yj
关系
解释
列
列
column-column-identity
i = j
column-column-sametable
yi和yj同属一个表
column-column-fk
yi是yj的外键
column-column-fkr
yj是$ {y}_{i} $的外键
表
表
table-table-identity
i = j
table-table-fkr
yi在yj中有一个外键列
table-table-fkb
yj在yi中有一个外键列
table-table-fk
yi和yj中均有对方的外键
表
列
table-column-pk
yj是yi的主键
table-column-has
yj属于yi,但不是主键
列
表
column-table-pk
yi是yj的主键
column-table-has
yi属于yj,但不是主键
Tab.2Relation types of schema items
模型
列
表
值
P
R
F1
P
R
F1
P
R
F1
SLSQL
0.826
0.820
0.823
0.806
0.840
0.822
0.773
0.741
0.757
本文方法
0.878
0.853
0.865
0.838
0.855
0.846
0.892
0.883
0.888
Tab.3Results of schema linking
问题
Ground Truth
本文方法
SLSQL
1. What is the average $\underline{{\rm{rank}}}$ for winners in all matches?
matches.winner_rank - column
matches.winner_rank - column
rankings.ranking- column
2. How many languages are spoken in $\underline{{\rm{Aruba}}}$?
country.Name - value
country.Name - value
countrylanguage.Language - value
3. What are flight numbers of flights arriving at Airport "$\underline{{\rm{APG}}}$"
flights.DestAirport - value
flights.DestAirport - value
airports.AirportName - value
Tab.4Comparison of model prediction results for some samples
模型
列
表
值
P
R
F1
P
R
F1
P
R
F1
默认模型
0.878
0.853
0.865
0.838
0.855
0.846
0.892
0.883
0.888
-依存增强
0.857
0.828
0.842
0.849
0.858
0.854
0.894
0.880
0.887
-问题RGAT
0.842
0.825
0.833
0.850
0.836
0.843
0.890
0.880
0.885
-模式RGAT
0.861
0.824
0.842
0.850
0.830
0.840
0.889
0.874
0.881
Tab.5Ablation study of schema linking
问题
Ground Truth
默认模型
-问题RGAT
What are the names and $\underline{{\rm{ranks}}}$ of the three youngest winners across all matches?
matches.winner_rank
matches.winner_rank
rankings.ranking
What is the total $\underline{{\rm{population}}}$ of Gelderland district?
city.Population
city.Population
country.Population
Return the first name, last name and email of the owners living in a state whose $\underline{{\rm{name}}}$ contains the substring 'North'.
Owners.state
Owners.state
Owners.email_address
Tab.6Comparison of column prediction
模型
列
表
值
P
R
F1
P
R
F1
P
R
F1
默认模型
0.878
0.853
0.865
0.838
0.855
0.846
0.892
0.883
0.888
列+值
0.860
0.842
0.851
0.852
0.840
0.846
0.865
0.857
0.861
问题词+值
0.866
0.847
0.856
0.852
0.879
0.866
0.799
0.788
0.794
问题词+列序号
0.848
0.841
0.844
0.858
0.867
0.863
0.832
0.815
0.823
Tab.7Comparison of input sequence
模型
EM/%
EX/%
T5-Base[19]
57.1
—
T5-3B[19]
70.0
—
Unifiedskg(T5-Base)[22]
58.1
60.1
Unifiedskg(T5-Large)[22]
66.6
68.3
Unifiedskg(T5-3B)[22]
71.8
74.4
T5-Base+PICARD[20]
68.5
68.4
T5-Large+PICARD[20]
69.1
72.9
T5-3B+PICARD[20]
75.5
79.3
TKK(T5-Base)[21]
61.5
64.2
TKK(T5-Large)[21]
70.6
73.2
RASAT(T5-Base)[17]
60.4
61.3
RASAT(T5-Large)[17]
66.7
69.2
RASAT(T5-3B) [17]
72.6
76.6
RASAT(T5-3B) +PICARD [17]
75.3
80.5
本文方法-T5-Base
68.5
71.6
本文方法-T5-Base+GT
76.3
75.2
本文方法-T5-Large
69.9
75.4
本文方法-T5-Large+GT
79.2
80.9
Tab.8Comparison of SQL generation results
模型
EX/%
easy
medium
hard
extra hard
TKK(T5-Large)[21]
89.5
76.5
52.9
45.2
T5-3B+PICARD[20]
95.2
85.4
67.2
50.6
RASAT(T5-3B)+PICARD[17]
96.0
86.5
67.8
53.6
本文方法-T5-Large
89.1
81.6
62.6
51.8
本文方法-T5-Large+GT
92.7
87.2
71.3
56.6
Tab.9Comparison of EX accuracy on different difficulty levels
模型
Spider-DK
Spider-Syn
EM/%
EX/%
EM/%
EX/%
RAT-SQL+BERT[11]
40.9
—
48.2
—
RAT-SQL+GRAPPA[29]
38.5
—
49.1
—
T5-Base[21]
—
—
40.8
43.8
T5-Large[21]
—
—
53.1
57.4
TKK(T5-Base)[21]
—
—
44.2
47.7
TKK(T5-Large)[21]
—
—
55.1
60.5
本文方法-T5-Base
39.3
46.6
49.8
54.6
本文方法-T5-Large
48.2
55.3
56.0
60.9
Tab.10SQL generation results on Spider-DK and Spider-Syn
[1]
刘喜平, 舒晴, 何佳壕, 等 基于自然语言的数据库查询生成研究综述[J]. 软件学报, 2022, 33 (11): 4107- 4136 LIU Xiping, SHU Qing, HE Jiahao, et al Survey on generating database queries based on natural language[J]. Journal of Software, 2022, 33 (11): 4107- 4136
[2]
GUO J, ZHAN Z, GAO Y, et al. Towards complex text-to-SQL in cross-domain database with intermediate representation [C]// Proceedings of ACL . Florence: ACL, 2019: 4524-4535.
[3]
WANG K, SHEN W, YANG Y, et al. Relational graph attention network for aspect-based sentiment analysis [C]// Proceedings of ACL . [S. l. ]: ACL, 2020: 3229-3238.
[4]
XU X, LIU C, SONG D X. SQLNet: generating structured queries from natural language without reinforcement learning [EB/OL]. (2017-11-13) [2023-10-27]. https://arxiv.org/abs/1711.04436.
[5]
YU T, LI Z, ZHANG Z, et al. TypeSQL: knowledge-based type-aware neural text-to-SQL generation [C]// Proceedings of NAACL-HLT . New Orleans: ACL, 2018: 588-594.
[6]
LEI W, WANG W, MA Z, et al. Re-examining the role of schema linking in text-to-SQL [C]// Proceedings of EMNLP . [S. l.]: ACL, 2020: 6943-6954.
[7]
CHOI D, SHIN M C, KIM E, et al RYANSQL: recursively applying sketch-based slot fillings for complex text-to-SQL in cross-domain databases[J]. Computational Linguistics, 2021, 47 (2): 309- 332
[8]
BOGIN B, BERANT J, GARDNER M. Representing schema structure with graph neural networks for text-to-SQL parsing [C]// Proceedings of ACL . Florence: ACL, 2019: 4560-4565.
[9]
BOGIN B, GARDNER M, BERANT J. Global reasoning over database structures for text-to-SQL parsing [C]// Proceedings of EMNLP-IJCNLP 2019 . Hong Kong: ACL, 2019: 3659-3664.
[10]
ZHONG V, XIONG C, SOCHER R. Seq2SQL: generating structured queries from natural language using reinforcement learning [EB/OL]. (2017-11-09) [2023-10-27]. https://arxiv.org/abs/1709.00103.
[11]
WANG B, SHIN R, LIU X, et al. RAT-SQL: relation-aware schema encoding and linking for text-to-SQL parsers [C]// Proceedings of ACL . [S. l. ]: ACL, 2020: 7567-7578.
[12]
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Advances in Neural Information Processing Systems . Long Beach: MIT Press, 2017: 5998–6008.
[13]
LIN X V, SOCHER R, XIONG C. Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing [C]// Findings of the ACL: EMNLP 2020 . [S. l.]: ACL, 2020: 4870-4888.
[14]
CAO R, CHEN L, CHEN Z, et al. Line graph enhanced text-to-SQL model with mixed local and non-local relations [C]// Proceedings of ACL-IJCNLP 2021 . [S. l.]: ACL, 2021: 2541-2555.
[15]
CAI R, YUAN J, XU B, et al Sadga: Structure-aware dual graph aggregation network for text-to-sql[J]. Advances in Neural Information Processing Systems, 2021, 34: 7664- 7676
[16]
HUI B, GENG R, WANG L, et al. S2SQL: injecting syntax to question-schema interaction graph encoder for Text-to-SQL parsers [C]// Findings of the ACL: ACL 2022 . Dublin: ACL, 2022: 1254-1262.
[17]
QI J, TANG J, HE Z, et al. Rasat: Integrating relational structures into pretrained seq2seq model for text-to-sql [C]// Proceedings of the 2022 Conference on EMNLP . Abu Dhabi: ACL, 2022: 3215-3229.
[18]
RAFFEL C, SHAZEER N, ROBERTS A, et al Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21 (1): 5485- 5551
[19]
SHAW P, CHANG M W, PASUPAT P, et al. Compositional generalization and natural language variation: can a semantic parsing approach handle both? [C]// Proceedings of ACL-IJCNLP 2021 . [S. l. ]: ACL, 2021: 922–938.
[20]
SCHOLAK T, SCHUCHER N, BAHDANAU D. PICARD: parsing incrementally for constrained auto-regressive decoding from language models [C]// Proceedings of the 2021 Conference on EMNLP . Punta Cana: ACL, 2021: 9895-9901.
[21]
GAO C, LI B, ZHANG W, et al. Towards generalizable and robust Text-to-SQL parsing [C]// Findings of the ACL: EMNLP 2022 . Abu Dhabi: ACL, 2022: 2113-2125.
[22]
XIE T, WU C H, SHI P, et al. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models [C]// Proceedings of the 2022 Conference on EMNLP . Abu Dhabi: ACL, 2022: 602-631.
[23]
LI J, HUI B, CHENG R, et al. Graphix-T5: mixing pre-trained Transformers with graph-aware layers for text-to-SQL parsing [EB/OL]. (2023-01-18) [2023-10-27]. https://arxiv.org/abs/2301.07507.
[24]
DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of NAACL-HLT 2019 . Minneapolis: ACL, 2019: 4171–4186.
[25]
YU T, ZHANG R, YANG K, et al. Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task [C]// Proceedings of the 2018 Conference on EMNLP . Brussels: ACL, 2018: 3911–3921.
[26]
GAN Y, CHEN X, PURVER M. Exploring underexplored limitations of cross-domain Text-to-SQL generalization [C]// Proceedings of EMNLP . [S. l. ]: ACL, 2021: 8926–8931.
[27]
GAN Y, CHEN X, HUANG Q, et al. Towards robustness of Text-to-SQL models against synonym substitution [C]// Proceedings of ACL-IJCNLP. [S. l. ]: ACL , 2021: 2505-2515.
[28]
SHAZEER N, STERN M. Adafactor: adaptive learning rates with sublinear memory cost [C]// Proceedings of the 35th International Conference on Machine Learning , Stockholm: PMLR, 2018: 4596-4604.