Sitemap

From Natural Language to SQL: Approaches and Challenges in Text2SQL

7 min readApr 10, 2025
Press enter or click to view image in full size

Background

Business users are often inconvenienced by writing SQL queries in order to get the relevant data from the database so that they can derive right insights. Sometimes they have an app that comes to their rescue. However if they want to tweak their queries, they are dependent on engineering team to make those changes. How good would it be if they could write their queries in plain English & get their data back? like NLP to SQL

Something like -

“Show me the total sales for product Sofwave in the last 2 months”

“How many Medical Spas or Plastic Surgery locations are there in New York?”

This article is exactly about that. The prime objective is to ask NLP queries, transform them into Snowflake queries & get the results back. And can we use the LLM models to do this (with some amount of training)?

In this article, I have tried 2 approaches to achieve Text2SQL functionality along with their merits and demerits. Wherever possible I have shared the issues encountered, how I overcame them & what are the learnings.

So grab your favorite coffee, because we’re about to embark on a wild ride..

Introduction

At the outset, we all need to agree that Text2SQL is not an easy problem to solve for the following reasons -

a) Linguistic Complexity and Ambiguity- Natural language questions often contain complex linguistic representations, such as nested clauses, coreferences, and ellipses, which make it challenging to map them accurately to the corresponding parts of SQL queries. Additionally, natural language is inherently ambiguous, with multiple possible representations for a given user question

b) Schema Understanding and Representation-To generate accurate SQL queries, text-to-SQL systems must have a comprehensive understanding of the database schema, including table names, column names, and relationships between various tables. Representing and encoding schema information is challenging because the database schema is often complex and varies significantly across different domains

c) Complex SQL Operations — Some SQL queries involve rare or complex operations and syntax in challenging scenarios, such as nested sub-queries, outer joins, and window functions. These operations are less frequent in the training data and pose challenges for text-to-SQL systems to generate accurately

Despite recent progress in large language models and AI in general, generating correct SQL reliably remains a challenge.

Press enter or click to view image in full size
Basic flow of a Text2SQL problem

Key considerations for Text2SQL problem

Prompt Definition — Providing clear, straight-forward prompts with few seed examples to the LLMs are vital to get proper SQL-compatible results

Intent — Expressing the intent of the query via things like tables, columns, synonyms & datatypes helps LLMs with vector embedding & similarity processing

I attempted following 2 approaches to solve this problem. Let us understand each approach in detail now.

Approach 1: DbSchema-based

In this approach we create a dbschema file of the existing database. It looks something like this-

Press enter or click to view image in full size
Fig 1 dbSchema file format

You can generate this schema file using following Python code

# Snowflake connection details
snowflake_uri = f"snowflake://{snowflake_user}:{snowflake_password}@{snowflake_account}/{snowflake_database}/{snowflake_schema}?warehouse={snowflake_warehouse}&role={snowflake_role}"

db = SQLDatabase.from_uri(snowflake_uri)
table_info = db.get_table_info()

Complete code can be seen at this github location. This approach posed several problems

Problem 1: Model’s maximum context length restriction exceeded

The size of database schema file was huge owing to the big database I was using. To get some context, I had about 193 tables & millions of records in total. I used the gpt-3.5-turbo LLM whose context window size is 16K

Solution: Followed a two-pronged strategy

Firstly, switched to a bigger model that supports higher value of Max Tokens & the context window. Used GPT-4-Turbo model that supports 128K context window. Secondly, added another level of processing. In this created another prompt to ask LLM what would be the relevant tables & columns to solve the query, thereby reducing the tokens needed. This effectively made the processing “Multi-Level-LLM”

Press enter or click to view image in full size
Fig 2 LLM models comparison

Problem 2: Slow response time of dbSchema

The code was reading the dbSchema on every run thereby adding significant CPU cycles in IO operations. Not good.

Solution: Caching. Created a cache which gets populated only the first time or when it is expired (Expiry set to 7 days). This change brought in huge performance gains as shown below

Press enter or click to view image in full size

Problem 3: Inconsistent naming of columns in the tables

The existing tables had inconsistent naming for columns. e.g. at some places we had LOCATION_ID while at other it was called LOCATIONID or LOCATION. Since we are dealing with a machine reading model, this adds a layer of complexity for LLM to decipher these subtle changes, especially in the absence of synonyms.

Solution: Normalize the columns in tables to follow consistent naming AND use the cached schema to validate the generated query against & do auto-correction. This solution worked for simple queries. However as we tried complex joins involving SQL functions, the auto-correction fell apart.

Approach 2: Semantic Model-based

What is a semantic model?

Semantic models are structured representations of knowledge that capture the meaning and relationships between concepts in a specific domain. A semantic model bridges the gap between human communication and machine interpretation. They are particularly important for GenAI tasks such as natural language processing, question answering, and content generation.

In the context of Snowflake Cortex, a semantic model provides information about the columns and tables you’re pointing at. This is especially useful when the column names aren’t already in natural language. The semantic model file (yaml-format file) will describe metadata about the columns including dimensions, measures, time, and any filters that end users may reference

Press enter or click to view image in full size
Fig 3: Sample Semantic Model Yaml file

Semantic Models are considered metadata and consist of following key elements

Metric is a quantifiable measure of business performance, typically calculated by aggregating facts across multiple rows e.g. AVG(), SUM() sql functions

Dimension represents categorical data that provides context to facts, such as product, customer, or location information. Dimensions typically contain descriptive text values, such as product names or customer addresses

Facts are measurable, quantitative data that provide context for analyses. Facts represent numeric values related to business processes, such as sales, cost, or quantity. A fact is an unaggregated, row-level concept

Filter is a condition that limits query results to specific data subsets based on criteria such as time period, location, or category.

How to create a semantic model?

In Snowflake, it can be created easily by following steps shown below

Press enter or click to view image in full size
Fig 4 Snowflake Semantic Model Creation

In the subsequent step, you will be asked for tables & columns (not shown for brevity) that you would like to have in your semantic model. The typical queries that will be asked by the user should dictate the selection of tables and columns in this step.

Once you have the model created and automatically stored in the snowflake stage, it can be accessed in your python code as follows

DATABASE = "<YOUR_DB>"
SCHEMA = "<YOUR_SCHEMA>"
STAGE = "<YOUR_STAGE_NAME>"
FILE = "<YOUR_SEMANTIC_MODEL_FILE>.yaml"
WAREHOUSE = "<YOUR_WH>"

request_body = {
"messages": [{"role": "user", "content": [{"type": "text", "text": prompt}]}],
"semantic_model_file": f"@{DATABASE}.{SCHEMA}.{STAGE}/{FILE}",
}

Complete code can be seen at this github location. This approach worked in majority of cases. The first query does take some time to give response, however subsequent queries work pretty quick. Also, it did not require any caching. However, even this approach did not give results in case the query involved multiple joins (see Fig 6).

Press enter or click to view image in full size
Fig 5: Chatty App Screenshot (Query working example)
Press enter or click to view image in full size
Fig 6: Chatty App Screenshot (Query not working example)

To solve for queries involving multiple joins, I created a simple view with those tables. Thereafter created a new semantic model with this view and now the same query worked.

Press enter or click to view image in full size
Fig 7 Chatty App Screenshot with working query now

Limitations of semantic model in Snowflake

  • Cortex Analyst imposes a 1 MB size limit on the semantic model file to restrict the size of API inputs
  • Cortex Analyst performs in-memory retrieval of sample values and verified queries added to the semantic YAML. After removing all sample values and verified queries, the semantic model cannot exceed 32K tokens (roughly 4 x 32K characters or approximately 128 KB)

References:

  1. https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-analyst/semantic-model-spec
  2. Code location — https://github.com/toashishagarwal/Snowflake-AI
  3. https://arxiv.org/pdf/2406.08426

If you liked my article, consider giving it a few claps & follow me on Linkedin.

Disclaimer: No part of this article is written with the help of ChatGPT or any other LLM (except the ghibli-style image at start). All the content is self-written.

Kindly give credit if you are referring it anywhere.

--

--

Ashish Agarwal
Ashish Agarwal

Written by Ashish Agarwal

Engineer and Water Color Artist @toashishagarwal

No responses yet