Accelerating FinTech Innovation with Natural Language to Code

Train financial LLMs with Gretel's Synthetic Text-to-Python dataset to transform natural language into precise, domain-specific Python code for FinTech.

Published by

Maarten Van Segbroeck

No items found.

•

Updated

November 25, 2024

FinTech thrives on data-driven decision-making, from fraud detection to regulatory compliance. Yet, implementing these solutions in Python often requires specialized skills, limiting accessibility and slowing innovation.

By leveraging large language models (LLMs), we can transform natural language into functional Python code, lowering technical barriers. Imagine asking, "Create a Python script to analyze transaction patterns for fraud detection", and instantly receiving executable code. This democratizes access to advanced financial analytics, enabling analysts, compliance officers, and other professionals to realize their ideas without extensive coding expertise.

For LLMs to deliver accurate, FinTech-specific code, they need tailored datasets. Existing datasets often miss domain-specific terminology and scenarios, limiting their utility. To address this, we created a synthetic Text-to-Python dataset for FinTech, enabling LLMs to generate precise, actionable code and accelerate industry innovation.

High-Quality Synthetic Data Generation with Gretel Navigator

Creating a high-quality synthetic Text-to-Python dataset requires a tailored and systematic approach. Using Gretel’s Navigator SDK and its new Data Designer mode, we streamline the process of designing, generating, and validating synthetic datasets.

Here’s how we create a FinTech-focused text-to-Python dataset:

1. Set Up the Workflow

The first step involves setting up the necessary libraries and configuring the synthetic data pipeline. This includes specifying the target programming language (Python) and selecting the appropriate suite of LLM models.

To get started, install the Gretel client:

pip install -Uqq gretel-client

2. Build the Dataset Configuration

Our Text-to-Python dataset configuration is carefully crafted for FinTech use cases. It uses a suite of Apache 2.0-licensed models to ensure open-source licensing compliance while delivering high-performance data generation.

Key Prompt Design Elements:

Special System Instructions: The LLM is designed as a Python expert skilled in FinTech, ensuring generated data reflects domain knowledge, regulatory compliance, and secure coding practices.
Categorical Seed Columns: Structure the dataset around industry_sector and topic, covering domains like "Fraud Detection" and "Algorithmic Trading." The code_complexity and code_concepts further refine outputs from beginner to expert-level tasks.
Generated Data Columns:
- text: Prompts reflect realistic FinTech scenarios tailored to the specified domain, topic, and complexity.
- code: Clean, executable Python code aligned with the generated prompts, ensuring adherence to the complexity and concepts specified.
Post-Processors: Validate Python code syntax (code_is_valid) and evaluate its alignment with prompts (text_to_python evaluator), ensuring high-quality and relevant outputs.

Below is the configuration used:

config = """

model_suite: apache-2.0

special_system_instructions: >-
  You are a highly skilled Python expert specializing in the FinTech domain. 
  Your knowledge spans areas like secure financial data handling, compliance with regulatory standards, 
  optimizing for large transaction volumes, and integrating financial services. You excel at converting 
  complex financial tasks from natural language into efficient Python code and vice versa. 
  Your goal is to assist the user with their Python-related FinTech tasks, producing high-quality, 
  secure, and maintainable Python code suitable for financial applications.

categorical_seed_columns:
  - name: industry_sector
    values:
      - Banking and Finance
      - FinTech and Digital Payments
      - Financial Regulation and Compliance
      - Fraud Detection and Prevention
      - Insurance and Risk Management
      - Mobile Banking
      - Mortgage and Lending Platforms
      - Smart Contracts
      - Tax Technology
      - Trading and Investment
    subcategories:
      - name: topic
        num_new_values_to_generate: 20

  - name: code_complexity
    values: [Beginner, Intermediate, Advanced, Expert]
    subcategories:
      - name: code_concept
        values:
          Beginner: [Variables, Data Types, Functions, Loops, Classes]
        Intermediate: [List Comprehensions, Object-oriented programming, Lambda Functions, Web frameworks, Pandas]
          Advanced: [Multithreading, Context Managers, Performance, Modules and Packages, Regular Expressions]
          Expert: [Custom Data Structures, Metaclasses, Coroutines, Memory Management]

generated_data_columns:
    - name: text
      generation_prompt: >-
        Generate an instruction to generate Python code to solve a specific problem. Instructions start with phrases such as "Write code to...", "Create a class that...", "Help me write...", etc \n

        Important Guidelines:
            * Make sure the instruction is related to the {industry_sector} sector, preferably by including relevant data models, systems, or sample data. If relevant, the instruction should be related to {topic}.
            * The instruction should be for code at a(n) {code_complexity} level, making use of {code_concept} when relevant.
            * Do not write any code as part of the question.
            * Do not write "Response:" in your response.
      columns_to_list_in_prompt: all_categorical_seed_columns

    - name: code
      llm_type: code
      generation_prompt: >-
        Write Python code that is the response to the following instruction prompt:
        {text} \n

        Important Guidelines:
            * Your code should be clean, complete, self-contained, and executable.
            * Remember to import any necessary libraries, and to use all libraries you import.
            * The code should be written at a {code_complexity} level, for example making use of concepts such as {code_concept}.
      columns_to_list_in_prompt: [industry_sector, topic]

post_processors:
    - validator: code
      settings:
        code_lang: python
        code_columns: [code]

    - evaluator: text_to_python
      settings:

‍

3. Run the Pipeline

With the configuration complete, we generate 5000 records using Gretel’s synthetic data pipeline. Each record consists of natural language prompts paired with corresponding Python code snippets tailored to FinTech.

Below is the code used. To execute the pipeline, you will need the Gretel API key, available from the Gretel Console.


session_kwargs = {
    "api_key": "prompt",
    "endpoint": "https://api.gretel.cloud",
    "cache": "yes",
}

# Initialize a Gretel Navigator Data Designer instance from the pre-defined configuration. 
designer = DataDesigner.from_config(config, **session_kwargs)

# Submit a batch workflow to generate synthetic data.
dataset_job = designer.submit_batch_workflow(num_records=5000)

# Wait for the workflow to complete and fetch the generated dataset.
generated_data = dataset_job.fetch_dataset(wait_for_completion=True)

‍

4. Validate Your Data Quality

Ensuring data integrity is critical. Our validation process rigorously evaluates the generated dataset for correctness, diversity, and FinTech relevance.

Key Validation Steps:

Code Validation: Post-processors ensure Python syntax validity (code_is_valid) and adherence to best practices (pylint_score).
LLM-Based Evaluation: The LLM-as-a-Judge assesses outputs for:
- Relevance: Alignment with prompts.
- Readability: Clarity and organization.
- Efficiency: Scalability and performance.
- Pythonic Standards: Compliance with Pythonic coding principles.
Dataset Metrics: Metrics like uniqueness (100%), semantic uniqueness (96.7%), and diversity indices ensure variety and quality.

Comprehensive Data Quality Report:

Gretel generates a detailed report with insights into data completeness, column cardinality, text diversity, and Python code quality.

To download the report:

path = batch_job.download_evaluation_report()

‍

Gretel’s Synthetic Text-To-Python Dataset for FinTech

The Gretel Synthetic Text-to-Python Dataset for FinTech is a curated collection of 25,000 records, crafted to support a wide range of FinTech coding applications. The dataset was generated by submitting multiple workflows through our agent-based, compound AI system, Gretel Navigator, retaining only records with valid Python code and a pylint score of 10, validated using the Data Processor.

Each record includes the following key fields:

industry_sector: Specific FinTech domain (e.g., Banking and Finance, Smart Contracts).
topic: Specific topic within the domain (e.g., Loan Forbearance, Mobile Banking).
code_complexity: Code complexity level (e.g., Beginner, Intermediate, Advanced, Expert).
code_concept: Programming concept demonstrated (e.g., Variables, Loops, Functions).
text: Natural language instruction describing the desired functionality.
code: Python code generated based on the instruction.

The dataset is publicly available on Hugging Face under gretelai/gretel-text-to-python-fintech-en-v1. For a detailed breakdown of domains, topics, and validation metrics, refer to the dataset card.

Conclusion

This dataset is only the start. With Gretel Navigator's new Data Designer mode, you can create synthetic datasets tailored to your unique business requirements, incorporating custom logic and meaningful context. We invite the community to harness these tools to accelerate innovation and unlock new possibilities in AI-driven solutions.

Data Designer is currently in early preview, if you would like to build datasets like this or try out other use cases, sign up here to participate!

If you have questions or comments about Data Designer or the Gretel Platform, join us and 1,600 other developers, engineers, data scientists, and privacy fans in the Synthetic Data Community on Discord.