Gretel Unlocks PII Detection with Synthetic Financial Document Dataset

Gretel releases a new synthetic financial document dataset to empower AI developers in building customized and highly performant sensitive data detection systems.

Introduction

Detecting and handling personally identifiable information (PII) in domain-specific documents is no easy task. If you’re an AI developer or organization working with sensitive data, you face a myriad of complex regulatory standards and security compliance issues, across countries and industries. For example, certain customer identifiers are handled differently under the EU’s GDPR versus U.S. financial data protection laws.

To help developers build customized and highly performant detection systems that can process these various forms of confidential information, Gretel has released a new multilingual synthetic financial dataset on HuggingFace 🤗 for AI training. The dataset is designed to improve the performance of Named Entity Recognition (NER) models and to validate PII scanning systems across various industries and languages. 

Empowering PII Detection with Domain-Specific Datasets 🛡️

Every organization has unique data formats and domain-specific requirements that may not be fully captured by existing Named Entity Recognition (NER) models or sample datasets. With Gretel Navigator, you can create customized synthetic datasets tailored to your specific needs in a fraction of the time and cost associated with traditional manual labeling techniques. Leveraging Gretel Navigator, you can rapidly create large-scale, diverse, privacy-preserving datasets that accurately reflect the characteristics and challenges of your domain, ensuring that your PII detection models are well-prepared for real-world scenarios, edge cases, and the unique document and PII types of your organization.

Using the Dataset for PII Detection

This dataset is designed to assist with the following use cases:

  1. 🏷️ Training NER models to detect and label PII in different domains.
  2. 🔍 Testing PII scanning systems on real, full-length documents unique to different domains.
  3. đź“Š Evaluating the performance of de-identification systems on realistic documents containing PII.
  4. đź”’ Developing and testing data privacy solutions for the financial industry.

About Gretel's Synthetic Financial Document Dataset

The gretelai/synthetic_pii_finance_multilingual dataset is the first in a series of datasets releases aimed at tackling PII detection challenges across different domains, including finance, health, retail, and IT-security. The dataset contains synthetically generated, full-length financial documents with labeled PII examples.

🔑 Key Features

  • 55,940 records partitioned into 50,776 training samples and 5,164 test samples
  • Coverage across 100 distinct financial document formats, with 20 specific subtypes for each format – everything from customer support logs to insurance policies and AML compliance audits
  • Synthetic PII with 29 distinct PII types, aligned with Python Faker library generators for easy detection and replacement
  • Full-length documents with an average length of 1,357 characters
  • Multilingual support for English, Spanish, Swedish, German, Italian, Dutch, and French
Figure 1. A graph of the types of PII data included in the synthetic dataset.

🔍 Synthetic Data Quality Assessment 

The LLM-as-a-Judge technique using the Mistral-7B language model was employed to ensure the quality of the synthetic PII and documents in this dataset. Each generated record was evaluated based on the following criteria:

  • Conformance Score: A score from 0-100 indicating the conformance of the generated text to the provided tags and descriptions, with 100 being fully conforming and 0 being non-conforming.
  • Quality Score: A score from 0-100 based on the grammatical correctness, coherence, and relevance of the generated text, with 100 being the highest quality and 0 being the lowest quality.
  • Toxicity Score: A score from 0-100 indicating the level of toxic content in the generated text, with 0 being non-toxic and 100 being highly toxic.
  • Bias Score: A score from 0-100 indicating the level of unintended biases in the generated text, with 0 being unbiased and 100 being heavily biased.
  • Groundedness Score: A score from 0-100 indicating the level of factual correctness in the generated text, with 100 being fully grounded in facts and 0 being completely ungrounded.

Records with a toxicity score or bias score above 20, or a groundedness score, quality score, or conformance score below 80, were removed from the dataset.

Figure 2. The distribution of scores across data quality and PII evaluation metrics.

To get started with the gretelai/synthetic_pii_finance_multilingual dataset, you can easily load it using the Hugging Face datasets library and query by language to create a DataFrame:

from datasets import load_dataset
import pandas as pd

# Load the dataset from Hugging Face
dataset = load_dataset("gretelai/synthetic_pii_finance_multilingual")

# Create a DataFrame for a specific language (e.g., English)
df = pd.DataFrame(dataset["train"]).query("language == 'English'")

For a more detailed exploration of the dataset and code examples, check out our Jupyter Notebook. 

Upcoming: Fine-Tuning Customized NER Models 

Stay tuned for our upcoming blog post, where we will dive into how you can use the gretelai/synthetic_pii_finance_multilingual dataset to fine-tune your own customized NER model for PII scanning and even replacing PII with synthetic examples. This post will provide a step-by-step guide on leveraging this dataset to enhance your PII detection capabilities and ensure compliance with data privacy regulations.

Gretel's Transform API

For those looking to detect PII across their text and tabular formats without the need for custom model training, Gretel offers the Transform API. This service provides a simple, yet powerful way to identify and protect sensitive data in your documents, learn more on our blog.

Supporting the Open Data Community 🤝

At Gretel, we are committed to promoting open data and fostering collaboration within the AI community. We believe that by creating and sharing high-quality, diverse, and ethically sourced datasets, we can accelerate the development of more accurate, unbiased, and trustworthy AI systems.

Our release of the gretelai/synthetic_pii_finance_multilingual dataset on HuggingFace is just one example of our dedication to this cause. By providing a synthetic dataset that covers a wide range of financial document formats, PII types, and languages, we aim to empower developers and researchers to build more robust and customizable PII detection solutions.

Gretel is excited to expand our open synthetic dataset offerings, too, including updates to this dataset to cover more PII data labels, languages, and formats. We encourage the community to submit their suggestions and requests for additional synthetic datasets on our Discord channel. Your feedback will help us prioritize future releases and ensure that we are addressing the most pressing PII detection challenges faced by the community.