Abstract:
The growing reliance on relational databases across industries and the ability to efficiently query and extract from a structured database has become a crucial skill in the industry. However, the Complexity of SQL Syntax creates a barrier for non-technical uses limiting their ability to interact with databases effectively. Natural Language to SQL (NL-to-SQL) query generation performs a critical task in bridging gap between non-technical users and relational databases and enables intuitive data interaction with out any need for SQL expertise. This thesis first explores various Text-to-SQL approaches, leveraging both proprietary model like Open AI’s GPT-4 and open-source models like RESDSQL, focusing on their performance across benchmark datasets like Spider, CoSQL and SPARC. Additionally, two datasets, MORD and CMEC are prepared from the real world use cases to highlight unique challenges such as hierarchical data structures, string matching operations, and privacy issues. The MORD dataset was queried using GPT-4 integrated with LangChain, to showcase natural language interaction with data and the usability of proprietary models without any tuning to domain specific dataset. Meanwhile the CMEC dataset is a privately curated dataset and access to it needs to be confidential. So we use open source models like RESDSQL that run on local server in order to minimize leakage. The dataset is pre-processed into a relational schema, and RESDSQL is fine tuned on curated NL to SQL pairs to improve performance. String matching techniques are applied to prepare better prompts in order to further enhance the results generated by the model.