Spaces:
Runtime error
Runtime error
File size: 5,045 Bytes
bc3b4e5 931b664 3445828 931b664 63e2a80 bc3b4e5 bf5627a bc3b4e5 84c9d20 bc3b4e5 1b76a92 6d8f6c2 1b76a92 13da841 bc3b4e5 80d6fc7 c26b2b4 c074a35 326c2f1 2995161 0a0f99c 2995161 5a5027e 2995161 84677f5 d6f9651 2995161 d6f9651 2995161 714b133 d6f9651 2995161 d6f9651 326c2f1 2995161 d6f9651 2995161 d6f9651 2995161 d6f9651 cc48701 2995161 d6f9651 1134ee9 cd47483 7b7c1be ad7d65a 84677f5 a0cefd0 2cf2cd7 2995161 d6f9651 2cf2cd7 2995161 d6f9651 2995161 d6f9651 cc48701 5dde474 cc48701 5dde474 cc48701 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
title: Synthetic Data Generator
short_description: Build datasets using natural language
emoji: π§¬
colorFrom: yellow
colorTo: pink
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: true
license: apache-2.0
hf_oauth: true
#header: mini
hf_oauth_scopes:
- read-repos
- write-repos
- manage-repos
- inference-api
---
<br>
<h2 align="center">
<a href=""><img src="https://raw.githubusercontent.com/argilla-io/synthetic-data-generator/main/assets/logo.svg" alt="Synthetic Data Generator Logo" width="80%"></a>
</h2>
<h3 align="center">Build datasets using natural language</h3>
![Synthetic Data Generator](https://huggingface.co/spaces/argilla/synthetic-data-generator/resolve/main/assets/ui-full.png)
## Introduction
Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs. [The announcement blog](https://huggingface.co/blog/synthetic-data-generator) goes over a practical example of how to use it.
Supported Tasks:
- Text Classification
- Chat Data for Supervised Fine-Tuning
This tool simplifies the process of creating custom datasets, enabling you to:
- Describe the characteristics of your desired application
- Iterate on sample datasets
- Produce full-scale datasets
- Push your datasets to the [Hugging Face Hub](https://huggingface.co/datasets?other=datacraft) and/or [Argilla](https://docs.argilla.io/)
By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.
<p align="center">
<a href="https://twitter.com/argilla_io">
<img src="https://img.shields.io/badge/twitter-black?logo=x"/>
</a>
<a href="https://www.linkedin.com/company/argilla-io">
<img src="https://img.shields.io/badge/linkedin-blue?logo=linkedin"/>
</a>
<a href="http://hf.co/join/discord">
<img src="https://img.shields.io/badge/Discord-7289DA?&logo=discord&logoColor=white"/>
</a>
</p>
## Installation
You can simply install the package with:
```bash
pip install synthetic-dataset-generator
```
### Quickstart
```python
from synthetic_dataset_generator.app import demo
demo.launch()
```
### Environment Variables
- `HF_TOKEN`: Your [Hugging Face token](https://huggingface.co/settings/tokens/new?ownUserPermissions=repo.content.read&ownUserPermissions=repo.write&globalPermissions=inference.serverless.write&tokenType=fineGrained) to push your datasets to the Hugging Face Hub and generate free completions from Hugging Face Inference Endpoints. You can find some configuration examples in the [examples](examples/) folder.
Optionally, you can set the following environment variables to customize the generation process.
- `MAX_NUM_TOKENS`: The maximum number of tokens to generate, defaults to `2048`.
- `MAX_NUM_ROWS`: The maximum number of rows to generate, defaults to `1000`.
- `DEFAULT_BATCH_SIZE`: The default batch size to use for generating the dataset, defaults to `5`.
Optionally, you can use different models and APIs.
- `BASE_URL`: The base URL for any OpenAI compatible API, e.g. `/static-proxy?url=https%3A%2F%2Fapi-inference.huggingface.co%2Fv1%2F%60%3C%2Fspan%3E%2C `https://api.openai.com/v1/`.
- `MODEL`: The model to use for generating the dataset, e.g. `meta-llama/Meta-Llama-3.1-8B-Instruct`, `gpt-4o`.
- `API_KEY`: The API key to use for the generation API, e.g. `hf_...`, `sk-...`. If not provided, it will default to the provided `HF_TOKEN` environment variable.
- `MAGPIE_PRE_QUERY_TEMPLATE`: Enforce setting the pre-query template for Magpie generation to either `llama3`, `qwen2`. Not that this is only used if the model is a Qwen or Llama model. If you want to use other model families for chat data generation, feel free to [implement your own pre-query template](https://github.com/argilla-io/distilabel/pull/778/files).
Optionally, you can also push your datasets to Argilla for further curation by setting the following environment variables:
- `ARGILLA_API_KEY`: Your Argilla API key to push your datasets to Argilla.
- `ARGILLA_API_URL`: Your Argilla API URL to push your datasets to Argilla.
### Argilla integration
Argilla is a open source tool for data curation. It allows you to annotate and review datasets, and push curated datasets to the Hugging Face Hub. You can easily get started with Argilla by following the [quickstart guide](https://docs.argilla.io/latest/getting_started/quickstart/).
![Argilla integration](https://huggingface.co/spaces/argilla/synthetic-data-generator/resolve/main/assets/argilla.png)
## Custom synthetic data generation?
Each pipeline is based on distilabel, so you can easily change the LLM or the pipeline steps.
Check out the [distilabel library](https://github.com/argilla-io/distilabel) for more information.
## Development
Install the dependencies:
```bash
# Create a virtual environment
python -m venv .venv
source .venv/bin/activate
# Install the dependencies
pip install -e . # pdm install
```
Run the app:
```bash
python app.py
```
|