Privacy-First Chatbot Enhancement in Finance with Databricks and Gretel

Enhance finance chatbots with privacy-first synthetic data using Gretel and Databricks to boost performance while ensuring compliance with privacy regulations.

In the fast-paced world of financial services, customer support is a critical differentiator. However, the sensitive nature of financial data presents a unique challenge: How can we leverage customer interaction data to improve support without compromising privacy? This blog post demonstrates a powerful solution using Gretel's synthetic data capabilities within the Databricks Data Intelligence Platform to enhance customer support chatbots using a Retrieval Augmented Generation (RAG) workflow while ensuring compliance with privacy regulations.

The Challenge: Balancing Innovation and Privacy

Financial institutions sit on a goldmine of customer interaction data that could dramatically improve their support systems. These organizations are constantly challenged with finding new ways to leverage AI and machine learning to enhance customer experience without compromising privacy. Chatbots, commonly used in customer support, present an opportunity for such improvement.

Retrieval-augmented Generation (RAG) workflows can significantly boost chatbot performance, leading to higher-quality responses, reduced resolution times, and increased customer satisfaction. RAG achieves this by providing chatbots access to historical and contextual information, which augments the AI models they use.

However, implementing RAG in financial institutions faces a significant hurdle: the customer interaction data needed to power these improvements often contains sensitive information. Using this data directly in machine learning models or sharing it with third-party vendors poses substantial privacy risks, preventing its use in AI/ML applications.

This is where synthetic data offers a powerful solution. By creating purpose-built datasets that maintain the statistical properties of the original data without exposing sensitive information, synthetic data unlocks new possibilities. It allows financial institutions to provide contextual information to their chatbots and leverage RAG workflows safely, effectively improving customer experience while maintaining strict privacy standards.

Our Solution: Gretel + Databricks

Gretel’s Navigator Fine Tuning is a capability that allows customers to synthesize private and safe to use datasets based on existing data. Gretel Navigator Fine Tuning is specifically designed for generating synthetic tabular datasets and excels at handling mixed data types, including numerical, categorical, free text, JSON, and time series values. 

One of the key benefits of using Navigator Fine Tuning for customer call records is its ability to reduce the chance of replaying values from the original dataset, especially rare values. This feature is crucial for maintaining privacy and preventing the exposure of sensitive customer information in the synthetic data which makes it an excellent choice for creating privacy-preserving, realistic synthetic datasets that can be safely used in RAG applications.

We'll walk through a process that combines Gretel's advanced synthetic data generation with Databricks' powerful data processing capabilities to do this.

In a real-world scenario, a financial institution would typically start with their own private customer interaction data. This data would be sensitive and subject to strict privacy regulations. For the purposes of this blog post and to demonstrate the process without using actual private data, we'll use Gretel Navigator to create a fake dataset that mimics real customer interaction data. This approach allows us to showcase the workflow without compromising any real customer information.

Then we will use Navigator Fine Tuning to generate a private and safe dataset which we will store in the Databricks file system, and finally we will use this dataset in a RAG workflow.

Generating a Dataset

We'll use Gretel Navigator to generate a synthetic dataset based on the following schema and store it in a Databricks table.

1. Customer ID: A unique identifier for each customer interaction

2. Timestamp: The date and time of the customer interaction

3. Interaction Type: Type of interaction (e.g., chat, email, phone)

4. Customer Query: The text of the customer's question or request

5. Intent: The identified intent of the customer query

6. Response: The response provided to the customer

7. Resolution Status: Whether the query was resolved successfully

8. Agent Notes: Additional notes from the support agent

9. Sentiment: The sentiment score of the interaction

You can create this data using Gretel Navigator through the Gretel Console or Python SDK. 

To use the Console, launch Gretel, click “Navigator” on the left navigation bar, and paste in the following prompt.

Here's a Python snippet demonstrating how to use Gretel's Python SDK to generate this synthetic dataset:

from gretel_client import Gretel
gretel = Gretel()


tabular = gretel.factories.initialize_navigator_api("tabular", backend_model="gretelai/auto")


prompt = """\
Generate customer support chatbot data:
customer_id: A unique numeric identifier for each customer interaction (e.g, 1234).
timestamp: The date and time of the customer interaction (e.g. 2024-03-15 10:01:17).
interaction_type: Type of interaction (e.g., chat, email, phone).
customer_query: The text of the customer’s question or request.
intent: The identified intent of the customer query (e.g., account balance inquiry, transaction dispute).
response: The response provided to the customer.
resolution_status: Whether the query was resolved successfully.
agent_notes: Additional notes from the support agent.
sentiment: The sentiment score of the interaction (e.g., positive, neutral, negative).


"""


# generate tabular data from a natural language prompt
df = tabular.generate(prompt, num_records=150)

This process will generate a dataset that resembles real customer interaction data but is entirely synthetic. It will contain realistic-looking customer queries, responses, and other fields, but none of the data will correspond to real individuals or interactions. Once the data is generated, you can save it as a delta table in Databricks File System (DBFS).

Synthesize a private and safe dataset

We'll use a Databricks notebook to create a Gretel Workflow that connects to the dataset we produced, synthesizes a private version of it, and writes it back to DBFS. You can set this up through the Gretel Console or Python SDK. The following steps show how you can use the Console for this.

In this Workflow we provide a data source, a data destination, and configure a Navigator Fine Tuning model to generate our synthetic dataset.

Step 1: Create a new Workflow

Step 2: Create a new Project or use an Existing Project

Step 3: Select a Model

Step 4: Connect to your data source

Here we use a Databricks connector, you can create a connector to your Databricks environment by following these instructions.

Step 5: Select your output data destination

Step 6: Setup a schedule

You can run your Workflows immediately or set up a recurring schedule. As new customer interaction data is produced, you can schedule Workflows to create synthetic versions of the new data on a regular cadence.

Step 7: Run the Workflow

Step 8: Review results

Gretel provides us with a Synthetic Quality Score to help analyze the dataset produced.

You can improve the data privacy score by increasing the amount of training data you use or underfitting the model by early stopping. You can find more details the SQS scores in our documentation.

Now that we have our private synthetic dataset stored in Databricks, we are ready to use it for our RAG application.

Implementing RAG with Synthetic Data

With our privacy-safe synthetic dataset in hand, we can now implement a Retrieval-Augmented Generation (RAG) workflow to enhance our chatbot. This process involves:

1. Indexing the synthetic customer interactions

2. Implementing a retrieval mechanism to find relevant past interactions

3. Using these retrieved interactions to augment the input to a large language model

This approach allows the chatbot to leverage "experiences" from past interactions without risking exposure of real customer data. This tutorial from Databricks explains how you can set up a RAG workflow.

The following GitHub repo contains two notebooks that you can use to follow the example in this blog and using your own dataset!

Conclusion

The key benefit unlocked by Gretel is the ability to leverage sensitive customer data safely, which was previously unusable for AI system improvements due to privacy concerns. This innovative approach enhances chatbot performance through:

  • Enhanced contextual understanding: By using RAG with synthetic data that mirrors real customer interactions, chatbots can provide more relevant and accurate responses, significantly improving the customer experience compared to systems without access to such data.
  • Privacy-compliant continuous learning: As new customer interactions occur, synthetic datasets can be updated safely, allowing the chatbot to evolve and improve constantly without exposing real customer data or requiring frequent full model retraining.
  • Personalized financial problem-solving: Leveraging synthetic versions of unique customer histories enables chatbots to offer tailored solutions for complex financial queries, leading to faster resolution times and increased customer satisfaction while maintaining strict data privacy.

Blog Authors

Prasad Kona, SA, Databricks

Manjesh Mogallapalli - SWE, Gretel

Kirit Thadaka - PM, Gretel