Introducing world's largest synthetic open-source Text-to-SQL dataset

Gretel releases largest open source Text-to-SQL dataset to accelerate AI model training

Elevating data quality in the age of AI

One of the most notable recent shifts in the world of Large Language Models (LLMs) and AI more generally is the renewed focus on data quality. From cutting-edge models like Falcon, Phi-1.5, Phi-2, Aya, Gemini, and beyond, massive model performance gains are reaffirming the data-centric approach to AI, eloquently described by Andrew Ng as "the discipline of systematically engineering the data needed to build a successful AI system.”

"... training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme … Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others." – Phi-2: The surprising power of small language models
"Several works apply synthetic data generation to promote reasoning, code generation, and algorithmic skills. Here, we … expand upon these initial findings and explore the utility of synthetic data generation combined with translation." – Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
"We find that data quality is an important factor for highly-performing models, and believe that many interesting questions remain around finding the optimal dataset distribution for pre-training… [our] sources include vendor-created data, third-party licensed sources, and synthetic approaches." – Gemini: A Family of Highly Capable Multimodal Models

While the effect of poor data quality has been known to data and ML practitioners for a long time (“garbage in, garbage out”), leveraging high-quality data in ML applications has been anywhere from challenging to downright impossible. In fact, it has become standard practice for data teams to dedicate as much as 80% of their time to combing through, cleaning and improving data. And the problem is only getting worse:

"… critical challenges … persist in data management, with poor data quality emerging as a predominant issue for 57% of professionals, an increase from 41% in 2022." – dbt Labs 2024 State of Analytics Engineering

At Gretel, we believe there is a better way.

Pioneering high-quality synthetic data

At Gretel, we are addressing data quality issues head-on by providing organizations with tools to generate high-quality synthetic data. We are accelerating the transition to data-centric AI by allowing teams to produce data either from scratch or as an augmented version of existing data, all while preserving privacy and security.

Today, as part of this effort, we are excited to announce the release of a purely synthetic Text-to-SQL dataset. Available on Hugging Face, gretelai/synthetic_text_to_sql is a rich dataset of high-quality synthetic Text-to-SQL samples, designed and generated using Gretel Navigator, and released under Apache 2.0. We are thrilled to have the community leverage this synthetic data in a truly open-source fashion.

Dataset overview

As of April 2024, our gretelai/synthetic_text_to_sql dataset stands as the largest and most diverse synthetic Text-to-SQL dataset available to-date.

The dataset includes:

  • 105,851 records partitioned into 100,000 train and 5,851 test records
  • ~23M total tokens, including ~12M SQL tokens
  • Coverage across 100 distinct domains/verticals
  • Comprehensive array of SQL tasks: data definition, retrieval, manipulation, analytics & reporting
  • Wide range of SQL complexity levels, including subqueries, single joins, multiple joins, aggregations, window functions, set operations
  • Database context, including table and view create statements
  • Natural language explanations of what the SQL query is doing
  • Contextual tags to optimize model training

Data schema, preview, example & word clouds

The dataset includes 11 fields shown below:

id integer Data record id
domain string Domain/vertical (e.g., aerospace)
domain_description string Domain/vertical data description
sql_complexity string SQL code complexity label
sql_complexity_description string SQL code complexity description
sql_task_type string Type of SQL task being performed
sql_task_type_description string Description of the SQL task type
sql_prompt string Natural language SQL prompt
sql_context string Database context for the SQL prompt
sql string SQL corresponding to the SQL prompt
sql_explanation string Explanation of the SQL query
Table 1. Data preview

Example:

{
  "id": 39325,
  "domain": "public health",
  "domain_description": "Community health statistics, infectious disease tracking data, healthcare access metrics, and public health policy analysis.",
  "sql_complexity": "aggregation",
  "sql_complexity_description": "aggregation functions (COUNT, SUM, AVG, MIN, MAX, etc.), and HAVING clause",
  "sql_task_type": "analytics and reporting",
  "sql_task_type_description": "generating reports, dashboards, and analytical insights",
  "sql_prompt": "What is the total number of hospital beds in each state?",
  "sql_context": "CREATE TABLE Beds (State VARCHAR(50), Beds INT); INSERT INTO Beds (State, Beds) VALUES ('California', 100000), ('Texas', 85000), ('New York', 70000);",
  "sql": "SELECT State, SUM(Beds) FROM Beds GROUP BY State;",
  "sql_explanation": "This query calculates the total number of hospital beds in each state in the Beds table. It does this by using the SUM function on the Beds column and grouping the results by the State column."
}
Figure 1: Word clouds for the natural language prompt, database context, SQL, and SQL explanation

Data statistics

Table 2: Breakdown of text to SQL dataset features and corresponding data types and token counts
Figure 2: Token distributions for sql_promt, sql_context, sql, and sql_explanation
Figure 3: Breakdown of domains/verticals and corresponding records
Table 3: Breakdown of SQL complexity and corresponding records
Table 4: Breakdown of SQL task type and corresponding records

The significance of Text-to-SQL

Today, data is a critical asset helping companies and governments inform decision-making, understand customer behavior, drive strategy, and measure performance. Yet, extracting insights out of data is often challenging due to the need to know Structured Query Language (SQL), which is the standard language used to interact with data in databases, data warehouses and data lakes.

The idea behind Text-to-SQL is to enable users to interact with databases using natural language questions and commands, rather than having to write complex SQL statements manually. Imagine simply asking, "What are the names and prices of electronic products under $500, sorted from highest to lowest price?"  instead of manually crafting the following SQL query:

SELECT product_name, price 
FROM products
WHERE category = 'Electronics' AND price < 500
ORDER BY price DESC;

Doing the above well and at scale democratizes access to data by allowing natural language interactions with databases, thereby removing the barrier to entry and opening up a whole new world of opportunities to businesses. 

Training or fine-tuning a large language model (LLM) on Text-to-SQL data can be useful for developing natural language interfaces to databases. In fact, there are already signs of a substantial demand for data like this:

"A significant and growing number of data teams are investing in AI. 57% of respondents indicate that they currently—or soon expect to—manage data for AI training… there's a strong expectation of future benefits it might help create—including for self-serve data exploration." – dbt Labs 2024 State of Analytics Engineering

Confronting the challenges

One of the most promising approaches to Text-to-SQL today is using LLMs trained for this specific task and combining such LLMs with more involved approaches like Retrieval Augmented Generation (RAG).

The challenge today is that access to high-quality Text-to-SQL data to train or fine-tune LLMs is limited. Most Text-to-SQL datasets out there are manually curated and annotated, limiting their diversity, size, applicability and utility. The process is labor intensive, cumbersome, expensive, and often quite mind-numbing for the people involved! For example, the ubiquitous Spider text-to-sql dataset, consisting of 7k samples, was annotated by 11 college students at Yale, and took 1,000 hours to complete. That’s more than two weeks of non-stop work for a crew of 11 people and more than 4 months of work for a single individual  — an incredible amount of effort for a relatively small dataset in the context of large language models.

A secondary challenge is that of licensing. While Spider is available under a commercially permissive creative commons license CC-BY-SY-4.0, it is a copyleft license dataset, meaning a derivative work must be licensed under the same or a compatible copyleft license. This is quite different from an open-source license like MIT or Apache, which allow derivative works to be released under different license terms, without attribution or sharealike terms. In fact, the vast majority of datasets out there do not carry an MIT/Apache license (see this great list of 27 text-to-sql datasets compiled by Numbers Station).

‍Gretel Navigator opens the door to generating high-quality diverse synthetic data quickly and on-demand. This allows teams to innovate faster, shorten time in bringing ML solutions to production and to substantially lower the costs of development. 

Achieving unprecedented quality with Gretel Navigator

Generating a high-quality synthetic dataset from scratch and doing so at scale is no trivial exercise! Many considerations go into it, given how multifaceted the notion of quality is (it’s use-case specific and touches on things like formatting, accuracy, diversity, completeness, coherence, consistency, uniqueness, etc.). 

With Gretel Navigator, the generation of high-quality datasets now becomes a reality. Navigator is a compound AI system leveraging many components behind the scenes to make generating synthetic data a breeze. From overall agentic workflow, task planning, tool use, data post-processing, to synthetic data best practices, Gretel Navigator brings domain expertise as well as years of in-market learning and development to bear on the problem of high-quality synthetic data generation.

In order to properly assess the quality of our Text-to-SQL data, we leveraged the LLM-as-a-judge technique. Recent LLM-as-a-judge research demonstrates that “strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans”. This makes LLM-as-a-judge a quite scalable approach to approximating subject matter expertise and doing so in an explainable way. More importantly, in this case, it provides a way to objectively  evaluate quality, without relying on the benchmarks we developed internally.  

For this particular task, we are leaning into the SQL subject matter expertise of an LLM to holistically evaluate the quality of SQL across 1,000 randomly chosen samples of data. We use GPT-4 to score samples from our Text-to-SQL dataset and compare results to 1,000 randomly chosen samples from the b-mc2/sql-create-context dataset, which is an extension of the previously mentioned Spider dataset, and includes database context for an apples-to-apples comparison. Of a very important note here is that neither of the datasets has anything to do with GPT-4 or any output data produced by GPT-4.

We observe that our dataset consistently scores higher on

  • Compliance with SQL Standards: +54.6%
  • SQL Correctness: +34.5%
  • Adherence to Instructions: +8.5%
Figure 4: LLM-as-a-judge comparison of gretelai/synthetict_text_to_sql with b-mc2/sql-create-context dataset across five different criteria: (i) Adherence to Instructions, (ii) SQL Correctness, (iii) Readability and Maintanability, (iv) Scalability, and (v) Compliance with Standards

The grading rubric with explicit criteria used for the LLM-as-a-judge evaluation, along with a few examples demonstrating problematic samples in the b-mc2/sql-create-context dataset, are provided Hugging Face.

In addition to the above, the parsability and validity of SQL in both sql_context and sql fields has been verified using a python SQL Parser/Transpiler sqlglot and a SQL format/syntax/semantics validator sqlvalidator.

Table 5: Breakdown of SQL parsability and validity for gretelai/synthetict_text_to_sql and b-mc2/sql-create-context

Conclusion

Our release of the largest synthetic text-to-SQL dataset on Hugging Face is not just a milestone in the world of synthetic data; it's an invitation. We are excited to contribute to the open-source community and encourage developers, researchers, and data enthusiasts to take the dataset for a spin, and build upon it.

This release is also merely a glimpse into the capabilities of Gretel. The real value of synthetic data lies in the ability to design and iterate on data to address specific data gaps, incorporate unique business logic, and to infuse with use-case-specific context. We invite you to explore Gretel tools and capabilities to accelerate your journey towards data-centric AI.

AI/ML is always a team sport. A special thank you to Marjan Emadi, Dhruv Nathawani, Lipika Ramaswamy, Kendrick Boyd, Maarten Van Segbroeck, Matthew Grossman, Piotr Mlocek, and Drew Newberry. 👏👏👏 We're just getting started. 🚀