An Awesome Synthetic Multilingual Prompts Dataset

Gretel's latest open synthetic dataset aims to enhance LLM interactions and contributes to the popular 'awesome-chatGPT-prompts' GitHub repository.

Gretel is excited to announce the release of our "Synthetic Multilingual LLM Prompts" dataset! This comprehensive collection features 1,250 synthetic LLM prompts generated using Gretel Navigator, our compound AI system for designing custom datasets for better AI training. Inspired by the popular awesome chatGPT prompts dataset (107k GitHub stars), our dataset is available in seven different languages, all synthetically generated. In addition to using Gretel Navigator for data generation, we also used the same AI-system in a LLM-as-a-judge approach in order to ensure accuracy and diversity in prompts, and translation quality and consistency across the different languages.

This dataset is designed for  LLMs to generate diverse and multilingual responses based on the provided prompts. We are excited to contribute this dataset directly to the Awesome ChatGPT Prompts GitHub repository and its corresponding dataset on Hugging Face.

Disclaimer: The translations and overall quality of this dataset are generated synthetically and have not been perfected by human review. As a result, inaccuracies may be present. To that end, we invite the community to explore and contribute to this work, to enhance the versatility and richness of LLM interactions.

Dataset Overview

This dataset is designed to provide a rich collection of prompts that can be used with conversational LLM interfaces like the ChatGPT model. Each prompt is available in the following languages:

  • English (en) 🇺🇸
  • Dutch (nl_NL) 🇳🇱
  • French (fr_FR) 🇫🇷
  • Spanish (es_ES) 🇲🇽
  • German (de_DE) 🇩🇪
  • Portuguese (Brazilian) (pt_BR) 🇧🇷
  • Chinese (Simplified) (zh_CN) 🇨🇳

Dataset Schema

The main dataset consists of the following fields:

  • id: Prompt id.
  • act: The role or scenario for which the prompt is designed.
  • prompt: The prompt text in English.
  • prompt_nl_NL: The prompt text in Dutch.
  • prompt_fr_FR: The prompt text in French.
  • prompt_es_ES: The prompt text in Spanish.
  • prompt_de_DE: The prompt text in German.
  • prompt_pt_BR: The prompt text in Brazilian Portuguese.
  • prompt_zh_CN: The prompt text in Simplified Chinese.

Translation Quality

The translation quality of each prompt from English into the target language was assessed using the LLM-as-a-Judge method, powered by Gretel Navigator. Each translation was scored from 1 to 10 based on three key criteria:

  • Accuracy: How accurately the meaning is preserved in the translation.
  • Fluency: How naturally the translated text reads in the target language.
  • Consistency: How consistently terms and phrases are translated.

An overall score was then assigned to each translation, representing an average of the accuracy, fluency, and consistency scores. Only prompts with translations that achieved an overall score of 7 or higher were retained in the main dataset.

The scores and detailed evaluations for each language can be found in the corresponding CSV files uploaded separately:

  • prompt_nl_NL.csv
  • prompt_fr_FR.csv
  • prompt_es_ES.csv
  • prompt_de_DE.csv
  • prompt_pt_BR.csv
  • prompt_zh_CN.csv

Usage

To use this dataset, you can load it from Hugging Face using the following code snippet:

from datasets import load_dataset

dataset = load_dataset("gretelai/synthetic_multilingual_llm_prompts", "main")

License

This dataset is released under the Apache 2.0 license, making it open for public use with proper attribution.

Reference

If you use this dataset, please cite it as follows:

@software{gretel-synthetic-multilingual-llm-prompts-2024,
  author = {Van Segbroeck, Maarten and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Greco, Johnny and Boyd, Kendrick and Grossman, Matthew and Meyer, Yev},
  title = {{Synthetic Multilingual LLM Prompts}: A synthetic multilingual prompt dataset for prompting LLMs},
  month = {June},
  year = {2024},
  url = {https://huggingface.co/datasets/gretelai/synthetic_multilingual_llm_prompts}
}

Supporting the Open Data Community 🤝

At Gretel, we are committed to promoting open data and fostering collaboration within the AI community. We believe that by creating and sharing high-quality, diverse, and ethically sourced datasets, we can accelerate the development of more accurate, unbiased, and trustworthy AI systems.

Gretel is excited to expand our open synthetic dataset offerings with this dataset adding to our recent releases of the world’s largest synthetic Text-to-SQL dataset for improving interactions with databases, data warehouses, and data lakes; and a synthetic financial document dataset for training robust PII detection systems. We encourage folks to submit their suggestions and requests for additional synthetic datasets on our Discord channel. Your feedback will help us prioritize future releases and ensure that we are addressing the most pressing data quality challenges faced by the community.