How to Safely Query Enterprise Data with Langchain Agents + SQL + OpenAI + Gretel
Agent-based approaches coupled with large language models (LLMs) are quickly transforming how we interact with databases and data warehouses. Combined, these technologies enable natural language queries to data in your application or business, eliminating the need for SQL expertise to interact with data and even facilitating seamless queries across diverse systems.
In this post, we’ll walk through an example of how Langchain, LLMs (whether open-source models like Llama-2, Falcon, or API-based models from OpenAI, Google, Anthropic), and synthetic data from Gretel combine to create a powerful, privacy-preserving solution for natural language data interaction with data in databases and warehouses. We'll introduce key concepts such as Agents, LLM Chains, and synthetic data, then delve into a practical code example to bring these ideas to life.
- LLM Chains: Frameworks such as Langchain for developing applications powered by language models by chaining them together.
- Agents: Agents use an LLM to decide what actions to take and the order to take them in, making future decisions by iteratively observing the outcome of prior actions.
- Function Aware LLMs: Certain newer LLMs (like OpenAI’s GPT-3.5-turbo-0613 and Google’s PaLM text-bison) have been fine-tuned to detect when a function should be called and respond with the inputs that should be passed to the function.
- Synthetic data: An artificial version of the real-world created by data-aware generative models that can offer strong privacy guarantees to data. Gretel offers generative models for working with tabular data based on Transformer, GAN, and graph-based architectures.
- SQL Databases: The backbone holding the data you'll be querying. For today, we’ll use a SQLite database.
What is an Agent in Langchain?
Some applications will require not just a predetermined chain of calls to LLMs/other tools, but potentially an unknown chain that depends on the user's input, too. In these types of chains, there is an “agent” that has access to a suite of tools — for example math, or the ability to query a SQL database. Depending on the user input, the agent can then decide which, if any, of these tools to call.
Under the hood, the Langchain SQL Agent uses a MRKL (pronounced Miracle)-based approach, and queries the database schema and example rows and uses these to generate SQL queries, which it then executes to pull back the results you're asking for.
Generating synthetic tabular data
Before diving into the example, let's talk about synthetic data. With Gretel's models, you can make an artificial but statistically similar version of your sensitive data. This synthetic data is safe to use, thanks to math-backed privacy features like differential privacy. In our example, we'll use both real and synthetic data to show why this privacy is crucial when letting language models access sensitive info.
To generate your own synthetic data for this example, grab the IBM HR Employee Attrition dataset (or your own) and an API key from https://console.gretel.ai. You can run Gretel's quickstart notebook or console-based workflow to create a synthetic version of the data.
For this example, I used the Gretel Tabular DP model (notebook, docs) with an epsilon value of 5 for strong privacy guarantees that are great for regulated environments. For maximum accuracy while still maintaining privacy, you can also try the Gretel ACTGAN model (docs), which excels at working with highly dimensional tabular data to enable machine learning and analytics use cases.
Getting started: Installation
First, install dependencies.
Initializing the Langchain Agent
Note: Please use your OpenAI key for this, which should be kept private.
Here's the code to initialize the Langchain Agent and connect it to your SQL database.
Here, we are also importing some sample datasets. We'll use both a real and a synthetic version of the IBM attrition HR dataset. The synthetic version is generated using Gretel's Tabular DP model with an (ε) Epsilon of 5.
Querying the data
First, we'll create a helper function to compare the outputs of real data and synthetic data.
Which three departments have the highest attrition rates?
The results were quite similar between the synthetic and real datasets, giving us confidence in the synthetic data's reliability.
What is the distribution of ages by 10-year increments across the entire dataset?
Again, the distributions were notably similar between the synthetic and real data sets.
Which department travels the furthest from home?
In this case, we get a perfect match.
Importance of privacy: Re-identification attack example
Here, we illustrate a "re-identification attack" where vulnerabilities in even de-identified datasets can allow an attacker to re-identify individuals by combining known attributes. Such risks emphasize the danger of sharing data stripped of direct identifiers yet containing attributes that, when combined, can lead to identification — such as the combination of an attacker who knew someone’s age, gender, and department in the example below.
Synthetic data prevents direct linking of individual information as no record in the output is based on a single user’s data, effectively thwarting re-identification attacks and upholding privacy.
By using synthetic data, you not only protect privacy but also gain actionable insights—essential for any data-driven organization. When you blend this with agent-based approaches and large language models, you open the door for more and better stakeholder collaborations. No SQL expertise needed; simply use natural language to engage with your data across all levels of your organization.
This scalable solution democratizes data access and ushers in a new era of smart, privacy-conscious data interaction. For businesses eager to maintain a competitive edge in today's data-centric world, adopting these technologies isn't just an option; it's a must.
Try synthesizing your own data with this notebook and let us know what you think.