Privacy-preserving AI development with Azure & Gretel

Leveraging Gretel's privacy-preserving synthetic data generation platform to fine-tune Azure OpenAI Service models in the financial domain.

Published by

Alex Watson

No items found.

•

Updated

November 19, 2024

The Challenge

Financial institutions possess valuable data that could power AI innovation, but regulatory challenges and privacy concerns keep this data locked in silos. These concerns include:

Limited ability to share insights across teams
Difficulty customizing AI models with sensitive financial data
Innovation constrained by lack of data access for experimentation

The Solution: A Privacy-Preserving AI Architecture

Below is a reference architecture that addresses these industry challenges.

Figure 1. Gretel's Privacy-preserving AI architecture for synthetic data generation.

1. Data Governance Layer

Business teams maintain control of sensitive data
Configure privacy policies and access controls
Assign unique identifiers to data types (e.g., "10K_RISK_FACTORS") that can be used to generate synthetic data on demand
Enable access to their teams or share across the organization

2. Privacy-Preserving Synthetic Data Generation Layer

Train synthetic data models with differential privacy guarantees
Generate high-quality synthetic data by:
- Fine-tuning on your sensitive data
- Using pre-trained models with just a prompt
Support for many data types (tabular, text, JSON, mixed formats)
Automated quality and privacy testing and reports
Enable secure data sharing across teams

3. AI Development Layer

Built on Azure OpenAI Services:

Fine-tune models on synthetic data
Deploy standardized analysis formats
Maintain consistent evaluation metrics

From Concept to Reality

“EY is leveraging the privacy-protected synthetic data to fine-tune Azure OpenAI Service models in the financial domain," said John Thompson, Global Client Technology AI Lead at EY. "Using this technology with differential privacy guarantees, we generate highly accurate synthetic datasets—within 1% of real data accuracy—that safeguard sensitive financial information and prevent PII exposure. This approach ensures model safety through privacy attack simulations and robust data quality reporting. With this integration, we can safely fine-tune models for our specific financial use cases while upholding the highest compliance and regulatory standards.”

Example: Building a Financial Risk Analysis Copilot

In this example, we will create a tool for financial analysts by combining Gretel's differential privacy capabilities with Azure OpenAI Services' language models. This copilot serves as a blueprint to help analysts quickly spot and label risks in regulatory filings, drawing its intelligence from synthetic training data that preserves privacy while capturing essential financial insights.

Our approach focuses on recent regulatory filings from 2023-2024. While these filings are publicly available, they contain specialized financial language and are too recent to be included in standard LLM training data. These filings are therefore perfect for demonstrating how synthetic data can capture domain-specific nuances while maintaining privacy. Better yet, we can shape the synthetic data distribution through simple prompts, eliminating the need for expensive manual curation or expert labeling services.

Let’s Get Started

Creating this privacy-preserving synthetic text is straightforward with Gretel's GPT API. The process begins with a synthetic text generation job we've named "financial-risk-data." This job fine-tunes a Microsoft Phi-3 small language model to become our expert synthetic data generator, incorporating fine-tuning parameters and differential privacy controls to ensure the resulting data is both useful and secure.

schema_version: "1.0"
name: phi3-finance-risk
models:
  - gpt_x:
      data_source:
        - sensitive-labeled-sec-filings.csv
      ref_data: {}
      pretrained_model: gretelai/Phi-3-mini-128k-instruct
      params:
        batch_size: 8
        epochs: 5
        steps: null
        weight_decay: 0.01
        warmup_steps: 100
        lr_scheduler: linear
        learning_rate: 0.008723427557099954
        max_tokens: 784
        gradient_accumulation_steps: 32
      peft_params:
        lora_r: 8
        lora_alpha_over_r: 1
        target_modules:
          - q_proj
          - v_proj
          - k_proj
          - o_proj
          - gate_proj
          - down_proj
          - up_proj
      privacy_params:
        dp: true
        epsilon: 8
        delta: auto

Example synthetically generated training data, see the full dataset at

Input Data

Below is an example of a synthetically generated filing excerpt created using Gretel. The excerpt demonstrates how adding document and section identifiers to training data enables targeted synthetic data generation.

For example, appending identifiers like ||10-k||Item 7|| ensures the model generates content tailored to a specific document type and section. This approach not only improves context relevance but also allows for precise filtering during model inference.

Pro Tip: During training in Gretel, tagging documents with relevant metadata (e.g., document type, section, or content focus) enhances model performance and facilitates easier data retrieval and generation. This labeling strategy enables users to request synthetic data for specific document types and sections, such as risk factors in a 10-K filing.

The training excerpts were chunked into manageable sizes, typically around 10,000 characters, to optimize model training while maintaining coherence. For the purposes of this blog, the excerpt below has been abbreviated for readability.

{
  "input": "||10-k||Item 7||As of December 31, 2023, the company has $3.5 billion in outstanding term loans at a fixed rate of 5.1%, with maturities in 2026 and 2028. An additional $2.0 billion is outstanding on its revolving credit facility at a variable rate of LIBOR + 2.5% (current effective rate: 6.3%) [removed for brevity] The company has $1.2 billion cash on hand and a debt-to-equity ratio of 1.8. Hedge coverage stands at 30%, and upcoming debt service requirements exceed $500 million annually."
}

Model Output

Below is the model-generated summary based on the input excerpt. This structured output demonstrates the model’s ability to extract, quantify, and summarize financial risks from the input text. The output conforms to predefined schemas for structured data representation, making it actionable for downstream analysis.

{
  "risk_severity": "HIGH",
  "risk_categories": ["DEBT", "INTEREST_RATE", "LIQUIDITY"],
  "financial_impact": {
    "amount": 5500.0,
    "timeframe": "2026",
    "recurring": true
  },
  "key_metrics": {
    "debt_outstanding": 5500.0,
    "interest_rate": 6.3,
    "hedge_ratio": 0.3
  },
  "critical_dates": ["2026-12-31", "2028-12-31"],
  "analysis": "The company faces $5.5B in outstanding debt, with significant exposure to rising rates due to 70% variable-rate debt. Hedge coverage is limited to 30%, and annual debt service exceeds $500M, posing liquidity challenges."
}

Key Takeaways

Tagged Input Structure: Tags like ||10-k||Item 7|| make it easy to prompt the model for section-specific synthetic content. This is particularly useful for generating targeted data for filings like 10-Ks, 10-Qs, or specific sections such as MD&A or Risk Factors.
Quantitative Outputs: The output highlights key financial risks with precise metrics such as debt amounts, interest rates, hedge ratios, and critical dates, making it actionable for financial analysts.
Actionable Summaries: The generated analysis provides concise insights into risks, including exposures, challenges, and potential mitigation strategies.
Differential Privacy: Since the data is synthetically generated, it retains utility for analysis without compromising sensitive information from the original dataset.

This methodology not only enhances data usability but also ensures compliance with privacy regulations, making it a powerful tool for generating risk analysis training datasets.

Azure OpenAI Services Fine-Tuning Implementation

With our synthetic dataset prepared, we'll demonstrate how to fine-tune an Azure OpenAI Services model for financial risk analysis. The Gretel SDK streamlines this process by handling the integration between Gretel's synthetic data and Azure OpenAI Services' fine-tuning API.

For this implementation, we'll use GPT-4o-mini, an efficient model variant optimized for processing financial documents at scale. Let's walk through the implementation:

1. Setup Requirements

# Required environment variables
AZURE_OPENAI_ENDPOINT = "https://<your-instance>.openai.azure.com/" 
AZURE_OPENAI_API_KEY = "<your-api-key>"                            
API_VERSION = "2024-08-01-preview"

2. Fine-Tuning Implementation

from openai import AzureOpenAI
from gretel_client.fine_tuning import OpenAIFormatter, OpenAIFineTuner
from datasets import load_dataset

# Initialize Azure OpenAI client
azure_client = AzureOpenAI(
    api_version="2024-08-01-preview",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY")
)

# Load financial risk analysis dataset
dataset = load_dataset("gretelai/gretel-financial-risk-analysis-v1")
train_df = dataset["train"].to_pandas()
validation_df = dataset["test"].to_pandas()

# Define the system prompt for financial risk analysis
SYSTEM_MESSAGE = """You are an expert financial risk analyst. Analyze the provided text for financial risks, and output a structured assessment in JSON format including risk detection, specific risk flags, financial exposure details, and analysis notes."""

# Create formatter for data preparation
formatter = OpenAIFormatter(
    system_message=SYSTEM_MESSAGE,
    user_template="{input}",
    assistant_template="{output}"
)

# Initialize fine-tuning manager
fine_tuner = OpenAIFineTuner(
    openai_client=azure_client,
    formatter=formatter,
    train_data=train_df,
    validation_data=validation_df
)

# Prepare data and start fine-tuning
fine_tuner.prepare_and_upload_data()
fine_tuner.start_fine_tuning(
    model="gpt-4o-mini-2024-07-18",
    epochs=1
)

# Wait for completion and save results
fine_tuner.wait_for_fine_tune_job()

What's Happening Under the Hood

The Gretel SDK automates several complex processes:

Data Processing: Converts your synthetic training data into an Azure OpenAI friendly format
File Management: Natively handles upload and validation of training files to the Azure OpenAI Service
Training Orchestration: Manages the fine-tuning job lifecycle
Progress Monitoring: Tracks training metrics and status

This deployment makes your fine-tuned model available for inference through the Azure OpenAI Services API.

Results

We fine-tuned the model for three epochs, although most likely a single epoch is all that is necessary for training, from looking at the training and validation loss graphs.

Chart 1. Training & Validation Loss/Accuracy over Steps.

After deploying the model to Azure, we can run a few simple test cases.

Test Case 1:
Input: The Company has entered into a five-year contract to purchase raw materials from a single supplier in a volatile market. The contract requires minimum purchases of $10M annually with no cancellation clause. Recent market analysis suggests potential price fluctuations of up to 40% in the next year.

Analysis:
{
  "analysis": "No hedge against raw material price fluctuations; potential 40% increase in costs",
  "critical_dates": null,
  "financial_impact": {
    "amount": 4.0,
    "recurring": true,
    "timeframe": "2022"
  },
  "key_metrics": null,
  "risk_categories": [
    "OPERATIONAL",
    "MARKET"
  ],
  "risk_severity": "HIGH"
}
--------------------------------------------------------------------------------

Test Case 2:
Input: Company XYZ announced a major expansion into emerging markets, requiring $50M in upfront capital expenditure. The project will be funded through a combination of variable-rate loans (60%) and existing cash reserves. Market analysts expect interest rates to rise by 2% over the next year.

Analysis:
{
  "analysis": "Variable-rate loans expose Company XYZ to rising interest rates, potentially increasing debt servicing costs by $1M per year",
  "critical_dates": null,
  "financial_impact": {
    "amount": 1.0,
    "recurring": true,
    "timeframe": "YEAR"
  },
  "key_metrics": {
    "debt_outstanding": 30.0,
    "hedge_ratio": null,
    "interest_rate": 3.0,
    "tax_exposure": null
  },
  "risk_categories": [
    "INTEREST_RATE",
    "DEBT"
  ],
  "risk_severity": "MEDIUM"
}
--------------------------------------------------------------------------------

Test Case 3:
Input: The company has successfully completed its annual employee satisfaction survey with a 95% participation rate. Results show high employee engagement scores across all departments. The HR department is planning to implement new professional development programs next quarter, which will be covered by the existing training budget.

Analysis:
{
  "analysis": "No material financial risks identified.",
  "critical_dates": null,
  "financial_impact": null,
  "key_metrics": null,
  "risk_categories": [],
  "risk_severity": "NONE"
}
--------------------------------------------------------------------------------

The fine-tuned example model delivered significant improvements over human expert generated examples in speed, or in-context learning approaches in token efficiency and accuracy:

93% alignment with expert assessments
70% reduction in required prompt tokens
10x faster risk analysis
Organization-wide secure data access

Resources

To implement this pipeline in your organization:

Access our synthetic financial dataset on HuggingFace at: https://huggingface.co/datasets/gretelai/gretel-financial-risk-analysis-v1
Full notebook walk-through to fine-tune and deploy an Azure OpenAI Services model on the dataset above:

Conclusion

This privacy-preserving AI architecture demonstrates that organizations no longer need to choose between data privacy and AI innovation. By combining differential privacy, synthetic data, and modern AI capabilities, financial institutions can now safely democratize their data access while maintaining strict compliance standards. The resources we've provided offer a starting point for organizations ready to transform how they leverage their sensitive data for AI development. Learn more about the Azure OpenAI Services-Gretel collaboration, get started with Gretel, or contact us.

Get started with Gretel

Learn how our platform can streamline your data privacy and compliance.

Want to learn more?

Discover the benefits of synthetic data.

Request a demo

Generate synthetic data at scale

Synthesize millions of rows with high accuracy using Gretel ACTGAN.

Scale your data