File size: 5,422 Bytes
bc3b4e5
931b664
3445828
931b664
 
63e2a80
bc3b4e5
bf5627a
bc3b4e5
84c9d20
bc3b4e5
1b76a92
6d8f6c2
1b76a92
13da841
 
 
 
bc3b4e5
80d6fc7
c26b2b4
c074a35
 
 
326c2f1
2995161
0a0f99c
2995161
 
 
5a5027e
2995161
 
 
 
84677f5
d6f9651
2995161
d6f9651
2995161
 
 
714b133
d6f9651
2995161
d6f9651
326c2f1
 
 
 
 
 
 
 
 
 
 
 
2995161
d6f9651
2995161
d6f9651
2995161
 
 
d6f9651
cc48701
 
 
5532825
cc48701
5532825
cc48701
 
2995161
d6f9651
1134ee9
cd47483
 
 
7b7c1be
 
 
 
4106f96
7b7c1be
4106f96
 
84677f5
5532825
4106f96
 
 
2cf2cd7
 
 
2995161
 
d6f9651
2cf2cd7
 
46474c4
2cf2cd7
 
 
2995161
d6f9651
2995161
d6f9651
 
cc48701
 
 
 
 
 
5dde474
cc48701
 
5dde474
 
 
cc48701
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
title: Synthetic Data Generator
short_description: Build datasets using natural language
emoji: 🧬
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: true
license: apache-2.0
hf_oauth: true
#header: mini
hf_oauth_scopes:
- read-repos
- write-repos
- manage-repos
- inference-api
---
<br>

<h2 align="center">
  <a href=""><img src="https://raw.githubusercontent.com/argilla-io/synthetic-data-generator/main/assets/logo.svg" alt="Synthetic Data Generator Logo" width="80%"></a>
</h2>
<h3 align="center">Build datasets using natural language</h3>

![Synthetic Data Generator](https://huggingface.co/spaces/argilla/synthetic-data-generator/resolve/main/assets/ui-full.png)

## Introduction

Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. [The announcement blog](https://huggingface.co/blog/synthetic-data-generator) goes over a practical example of how to use it.

Supported Tasks:

- Text Classification
- Chat Data for Supervised Fine-Tuning

This tool simplifies the process of creating custom datasets, enabling you to:

- Describe the characteristics of your desired application
- Iterate on sample datasets
- Produce full-scale datasets
- Push your datasets to the [Hugging Face Hub](https://huggingface.co/datasets?other=datacraft) and/or [Argilla](https://docs.argilla.io/)

By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.

<p align="center">
<a href="https://twitter.com/argilla_io">
<img src="https://img.shields.io/badge/twitter-black?logo=x"/>
</a>
<a href="https://www.linkedin.com/company/argilla-io">
<img src="https://img.shields.io/badge/linkedin-blue?logo=linkedin"/>
</a>
<a href="http://hf.co/join/discord">
<img src="https://img.shields.io/badge/Discord-7289DA?&logo=discord&logoColor=white"/>
</a>
</p>

## Installation

You can simply install the package with:

```bash
pip install synthetic-dataset-generator
```

### Quickstart

```python
from synthetic_dataset_generator import launch

launch()
```

### Environment Variables

- `HF_TOKEN`: Your [Hugging Face token](https://huggingface.co/settings/tokens/new?ownUserPermissions=repo.content.read&ownUserPermissions=repo.write&globalPermissions=inference.serverless.write&tokenType=fineGrained) to push your datasets to the Hugging Face Hub and generate free completions from Hugging Face Inference Endpoints. You can find some configuration examples in the [examples](examples/) folder.

Optionally, you can set the following environment variables to customize the generation process.

- `MAX_NUM_TOKENS`: The maximum number of tokens to generate, defaults to `2048`.
- `MAX_NUM_ROWS`: The maximum number of rows to generate, defaults to `1000`.
- `DEFAULT_BATCH_SIZE`: The default batch size to use for generating the dataset, defaults to `5`.

Optionally, you can use different models and APIs. For providers outside of Hugging Face, we provide an integration through [LiteLLM](https://docs.litellm.ai/docs/providers).

- `BASE_URL`: The base URL for any OpenAI compatible API, e.g. `/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fv1%2F%60%3C%2Fspan%3E%2C `https://api.openai.com/v1/`, `http://127.0.0.1:11434/v1/`.
- `MODEL`: The model to use for generating the dataset, e.g. `meta-llama/Meta-Llama-3.1-8B-Instruct`, `openai/gpt-4o`, `ollama/llama3.1`.
- `API_KEY`: The API key to use for the generation API, e.g. `hf_...`, `sk-...`. If not provided, it will default to the provided `HF_TOKEN` environment variable.

SFT and Chat Data generation is only supported with Hugging Face Inference Endpoints , and you can set the following environment variables use it with models other than Llama3 and Qwen2.

- `MAGPIE_PRE_QUERY_TEMPLATE`: Enforce setting the pre-query template for Magpie, which is only supported with Hugging Face Inference Endpoints. Llama3 and Qwen2 are supported out of the box and will use `"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"` and `"<|im_start|>user\n"` respectively. For other models, you can pass a custom pre-query template string.

Optionally, you can also push your datasets to Argilla for further curation by setting the following environment variables:

- `ARGILLA_API_KEY`: Your Argilla API key to push your datasets to Argilla.
- `ARGILLA_API_URL`: Your Argilla API URL to push your datasets to Argilla.

### Argilla integration

Argilla is an open source tool for data curation. It allows you to annotate and review datasets, and push curated datasets to the Hugging Face Hub. You can easily get started with Argilla by following the [quickstart guide](https://docs.argilla.io/latest/getting_started/quickstart/).

![Argilla integration](https://huggingface.co/spaces/argilla/synthetic-data-generator/resolve/main/assets/argilla.png)

## Custom synthetic data generation?

Each pipeline is based on distilabel, so you can easily change the LLM or the pipeline steps.

Check out the [distilabel library](https://github.com/argilla-io/distilabel) for more information.

## Development

Install the dependencies:

```bash
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install the dependencies
pip install -e . # pdm install
```

Run the app:

```bash
python app.py
```