Synthetic Data Generation with FastData and Hugging Face

Community Article Published January 7, 2025

Introduction

Table of Contents

1. What is Synthetic Data Generation and Why is it Important Today?

2. What is Hugging Face Dataset Viewer?
Key Features of Hugging Face Dataset Viewer:

3. What is FastData?
Key Parameters for Data Generation with FastData and Hugging Face:

4. Setup Libraries
Install Dependencies

Setup Anthropic API Key

Setup Hugging Face Hub Key

5. Setup Artifacts
Schema Definition

Define Prompt Template

Define Inputs

6. Define Code

7. Conclusion

Introduction

In the world of artificial intelligence (AI) and machine learning (ML), access to high-quality data is essential for building accurate and reliable models. However, acquiring large datasets, especially those involving sensitive or private information, can be challenging and often comes with ethical and legal considerations. Synthetic data generation offers a powerful solution by creating artificial datasets that mimic real-world data without using actual personal information.

In this blog post, we will explore the process of generating synthetic data using FastData, a python library designed for this purpose, and how to integrate it with the Hugging Face Hub for dataset hosting and sharing.

What is Synthetic Data Generation and Why is it Important Today?
What is Hugging Face Dataset Viewer?
What is FastData?
Setup Libraries
Setup Artifacts
Define Code
Conclusion

1. What is Synthetic Data Generation and Why is it Important Today?

Synthetic data generation is the process of creating artificial data that mimics real-world data but does not use any actual sensitive or personal information. This data is generated using algorithms, simulations, or models. In the context of machine learning (ML) and deep learning, synthetic data is crucial because it:

Improves model performance: When real data is lacking (e.g., medical data, financial records, etc.), synthetic data helps train models without violating privacy or ethical constraints.
Reduces data scarcity: Many industries find collecting enough data to train complex models challenging. Synthetic data augments the data available for model training, providing a solution.
Ensures data privacy: Synthetic data doesn't contain real user information, which is important in domains like healthcare, finance, and insurance.

Synthetic data generation is becoming increasingly important in AI research. It allows researchers to overcome privacy issues and create robust models without needing access to real-world data.

2. What is Hugging Face Dataset Viewer?

If you're working on data research or machine learning projects, you need a reliable way to share and host your datasets. Hugging Face Hub provides a seamless platform for hosting and sharing datasets with the world. By hosting a dataset on the Hub, you get instant access to the Dataset Viewer, a tool that offers many features to explore and interact with datasets.

Dataset Viewer for fka/awesome-chatgpt-prompts dataset.

Key Features of Hugging Face Dataset Viewer:

Interactive exploration: You can inspect the data, view individual records, and perform basic analyses without any coding or downloading.
Get insightful statistics about the data
Search and filtering: It provides search and filtering options to help you find the most relevant data for your use case.
Collaboration: You can easily share the dataset's URL with colleagues or the broader research community.

You can upload your dataset to the Hugging Face Hub by following these steps. However, in this tutorial, we'll use FastData, which conveniently integrates this functionality, streamlining your process.

3. What is FastData?

FastData is a minimal Python library designed to simplify the process of generating synthetic data, particularly for training deep learning models. It offers several useful features:

Customizable schema: You can easily define a custom schema for the synthetic data you want to generate.
Data generation with templates: FastData supports prompt-based generation, where you define templates for the data and the library generates consistent and relevant outputs.
Multithreading: It supports multithreading, allowing you to generate large amounts of data in parallel by specifying the maximum number of workers.
Push to Hugging Face: In version 0.0.4, FastData added functionality to push the generated data directly to Hugging Face using the generate_to_hf method, making data sharing and collaboration even easier. It also supports incremental loading to prevent data loss and optimize performance.

Key Parameters for Data Generation with FastData and Hugging Face:

prompt_template (str): Template for generating prompts.
inputs (list[dict]): List of input dictionaries to be processed.
schema: Defines the structure of the generated data.
repo_id (str): The Hugging Face dataset name.
temp (float, optional): Temperature for generation. Controls randomness in the output.
sp (str, optional): The system prompt for the assistant. Defaults to "You are a helpful assistant."
max_workers (int, optional): Maximum number of worker threads. Defaults to 64.
max_items_per_file (int, optional): Number of items to save in each file.
commit_every (Union[int, float], optional): Minutes between each commit to Hugging Face.
private (bool, optional): Set the repository to private. Defaults to None.
token (Optional[str], optional): Token to use for committing to the repo.
delete_files_after (bool, optional): Whether to delete files after processing. Defaults to True.

4. Setup Libraries

FastData currently supports Anthropic models. Therefore, you must have an API key for the model you want to use and configure the necessary environment variables. Follow these steps to get set up:

Install Dependencies

pip install python-fastdata datasets

Setup Anthropic API Key

Create an account with Anthropic and obtain your API key.
Set the API key in your environment variables:

export ANTHROPIC_API_KEY="your_api_key_here"

Setup Hugging Face Hub Key

Create an account on Hugging Face and obtain your token.
Set the Hugging Face token in your environment variables:

export HF_TOKEN="your_hugging_face_token"

5. Setup Artifacts

To generate data, you first need to define the structure of the data using a schema. Next, you'll define the prompt used to generate the data, and finally, you’ll provide the inputs that will guide the data generation process.

Schema Definition

In this tutorial, we will generate a dataset of children’s stories, each consisting of a title, content, and a habit we aim to teach through the story. Let’s begin by defining the class:

from fastcore.utils import *
from fastdata.core import FastData
from datasets import load_dataset

class ChildrenStory:
    """
    Represents a children's story with a title, content, and the habit it promotes.
    """

    def __init__(self, title:str, content:str, habit:str):
        self.title = title
        self.content = content
        self.habit = habit
    
    def __repr__(self):
        return f"{self.title} ({self.habit}) ➡ *{self.content}*"

Define Prompt Template

prompt_template = """\
Generate Children's Stories with title, content and the corresponding habit on the following topic <topic>{text}</topic> 
"""

Define Inputs

We will be using another existing dataset as input for our generations. This dataset contains in the 'text' column an idea/topic about a good habit/routine we aim to teach with our stories like:

Setting a consistent bedtime routine, including reading a book, can improve children's sleep quality and overall health.

Let's generate 10 children's stories based on the topics in the infinite-dataset-hub/PositiveRoutine dataset:

inputs = load_dataset('infinite-dataset-hub/PositiveRoutine', split='train[:10]')

6. Define Code

Once the setup is complete, you can define the code to generate synthetic data and push it to Hugging Face Hub.

fast_data = FastData(model="claude-3-haiku-20240307") # Here you can choose change the Anthropic model
dataset_name = "children-stories-dataset" # Here you define the output dataset name

repo_id, stories = fast_data.generate_to_hf(
    prompt_template=prompt_template,
    inputs=positive_routines_dataset,
    schema=ChildrenStory,
    repo_id=dataset_name,
    max_items_per_file=4,
)

print(f"A new repository has been create on {repo_id}")
print(tales)

And that is! our dataset has been generated and pushed to the Hub:

In case you consider your process will take more time to finish and would like to make incremental updates of your dataset, just need to set up the following parameters in the generate_to_hf method, more specifically:

max_items_per_file: Specifies how many records to have in each shared file to be pushed to the hub.
commit_every: How many minutes do you want the scheduler to push to the hub, e.g. every 5 minutes

7. Conclusion

Synthetic data generation plays a crucial role in AI and ML projects by helping researchers and developers generate useful datasets where real data might be scarce or difficult to obtain. With tools like FastData and the Hugging Face Hub, generating and sharing synthetic datasets has never been easier. You can now create large datasets, push them to the Hugging Face Hub, and explore them interactively using the Dataset Viewer. The integration of FastData simplifies the entire process, making it easier to work with large-scale synthetic data generation.

Happy data generation!

Upvote