GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents

Gretel fine-tuned, synthetically-enhanced GLiNER models for better PII & PHI detection—datasets included.

Published by

No items found.

•

Updated

October 31, 2024

Improving Personally Identifiable Information (PII) and Protected Health Information (PHI) entity detection requires high-quality datasets, but accessing real-world sensitive data is often restricted due to privacy concerns. At Gretel, we’ve addressed this challenge by developing synthetic documents enriched with a wide variety of PII and PHI entities, such as social security numbers, medical record numbers, and email addresses. These documents span multiple industries and document types, providing comprehensive coverage for training and fine-tuning models like GLiNER.

The synthetic data generated with Gretel Navigator offers a controlled and diverse environment, enabling realistic model training without the need to expose any real personal data. This makes it an ideal solution for tasks like Named Entity Recognition (NER) and sensitive data redaction, which typically require access to private information. By using synthetic data, developers can simulate real-world conditions, safely building models that handle sensitive information while ensuring privacy compliance.

Dataset Statistics and Distribution

The gretelai/gretel-pii-masking-en-v1 dataset, created using Gretel Navigator and leveraging mistral-nemo-2407 as the back-end model, simulates real-world excerpts of documents filled with PII/PHI without exposing any sensitive information from actual records, making it a secure and ideal solution for developing and fine-tuning models in sensitive industries. It also includes formatting variants to ensure the robustness of fine-tuned GLiNER models across different document structures and layouts. By offering diverse scenarios, it pushes the boundaries of PII and PHI detection, giving developers the confidence to fine-tune models while maintaining privacy compliance.

The dataset is split into 50,000 training records, 5,000 validation records, and 5,000 test records, providing a well-balanced framework for model training and evaluation. It includes data from 45 different domains such as healthcare, finance, and cybersecurity, and covers a wide range of over 40 PII/PHI entity types, making it highly suitable for training models to recognize sensitive information across diverse real-world contexts.

Each document in the dataset is carefully constructed to include:

Unique Identifier (uid): Each document is assigned a unique identifier to ensure traceability and easy reference throughout the dataset.
Domain: Each document represents one of a large breadth of industries, such as healthcare, finance, and technology, ensuring broad applicability across various sectors.
Document Type: Each document is one of a variety of types, including contracts, medical records, and invoices, offering diverse scenarios for entity extraction tasks and model training.
Document Description: Each document is accompanied by a detailed description outlining its structure, typical fields, and intended purpose, providing crucial context for tasks like document classification and analysis.
Text Content: The core text of each document serves as a rich source of data for NLP tasks, such as text classification, Named Entity Recognition (NER), and sensitive information redaction.
PII/PHI Entity Annotations: Each document includes a wide range of PII and PHI entities, such as names, addresses, social security numbers (SSNs), and medical record numbers, all labeled by type.

Source: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1

The dataset has been released under the permissive Apache-2.0 license. For more detailed statistics on the dataset, including domain and entity type distributions, check out the dataset documentation on Hugging Face.

Fine-Tuning GLiNER Models: Enhancing PII and PHI Detection

To improve the performance of the GLiNER models for PII and PHI detection, we fine-tuned the following GLiNER models: knowledgator/gliner-bi-{small, base, large}-v1.0, calling them gretel/gretel-gliner-bi-{small, base, large}-v1.0.

Model fine-tuning was performed on the training split of the dataset, and using the validation set to provide feedback during training, allowing us to monitor the model's performance and adjust as needed. Final model performance was evaluated on the test set. The PII and PHI entities in the dataset were used as ground truth labels, guiding the model to learn accurate entity detection across diverse domains and document types.

The model variants are available on Hugging Face:

gretelai/gretel-gliner-bi-small-v1.0: Lightweight, suitable for resource-constrained environments while maintaining strong performance.
gretelai/gretel-gliner-bi-base-v1.0: Balanced performance, ideal for most use cases requiring efficient resource usage and high accuracy.
gretelai/gretel-gliner-bi-large-v1.0: The highest-performing model, recommended for applications where accuracy is paramount.

The table below shows the performance of each model compared to their base model counterparts in terms of accuracy, precision, recall, and F1 score. Gretel's models outperformed the base models, achieving significantly higher metrics across the board.

Figure 1. Heat map showing Gretel Synthetics boosting performance across all base models.

The fine-tuning process greatly improved GLiNER models' ability to generalize across contexts and boosted their accuracy in identifying sensitive information, demonstrating the effectiveness of Gretel-generated synthetic data in optimizing model performance.

To use the Gretel Gliner models, you need to first install then, install or update the gliner package:

pip install gliner -U

Below is an example using the gretelai/gretel-gliner-bi-base-v1.0 model for PII/PHI detection:

from gliner import GLiNER

# Load the fine-tuned GLiNER model
model = GLiNER.from_pretrained("gretelai/gretel-gliner-bi-small-v1.0")

# Sample text containing PII/PHI entities
text = """
Purchase Order
----------------
Date: 10/05/2023
----------------
Customer Name: CID-982305
Billing Address: 1234 Oak Street, Suite 400, Springfield, IL, 62704
Phone: (312) 555-7890 (555-876-5432)
Email: janedoe@company.com
"""

# Define the labels of expected PII/PHI entities
labels = [
    "medical_record_number",
    "date_of_birth",
    "ssn",
    "date",
    "first_name",
    "email",
    "last_name",
    "customer_id",
    "employee_id",
    "name",
    "street_address",
    "phone_number",
    "ipv4",
    "credit_card_number",
    "license_plate",
    "address",
    "user_name",
    "device_identifier",
    "bank_routing_number",
    "date_time",
    "company_name",
    "unique_identifier",
    "biometric_identifier",
    "account_number",
    "city",
    "certificate_license_number",
    "time",
    "postcode",
    "vehicle_identifier",
    "coordinate",
    "country",
    "api_key",
    "ipv6",
    "password",
    "health_plan_beneficiary_number",
    "national_id",
    "tax_id",
    "url",
    "state",
    "swift_bic",
    "cvv",
    "pin"
]

# Predict entities with a confidence threshold of 0.7
entities = model.predict_entities(text, labels, threshold=0.7)

# Display the detected entities
for entity in entities:
    print(f"{entity['text']} => {entity['label']}")

And expected output

CID-982305 => customer_id
1234 Oak Street, Suite 400 => street_address
Springfield => city
IL => state
62704 => postcode
(312) 555-7890 => phone_number
555-876-5432 => phone_number
janedoe@company.com => email

Conclusion

The fine-tuned GLiNER models, trained on Gretel-generated synthetic data, are designed for highly accurate PII and PHI detection across various industries while ensuring privacy compliance. Their ability to generalize across diverse domains makes them ideal for applications in healthcare, finance, and more.

These models will soon be integrated into Gretel Transform V2, offering users even more robust options for handling sensitive data.