synthetic-data-generator

Running

App Files Files Community

davidberenstein1957 HF staff commited on Dec 3, 2024

Commit

2995161

1 Parent(s): f007fb2

fix oauth behavior locally

Browse files

Files changed (10) hide show

README.md +60 -27
app.py +3 -7
assets/ui.png +0 -0
demo.py +0 -61
pdm.lock +127 -2
pyproject.toml +1 -1
src/distilabel_dataset_generator/__init__.py +26 -0
src/distilabel_dataset_generator/apps/base.py +3 -2
src/distilabel_dataset_generator/pipelines/base.py +1 -1
src/distilabel_dataset_generator/utils.py +8 -19

README.md CHANGED Viewed

@@ -18,47 +18,80 @@ hf_oauth_scopes:
 - inference-api
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-<div class="header-container">
-    <div class="title-container">
-        <h1 style="margin: 0; font-size: 2em;">🧬 Synthetic Data Generator</h1>
-        <p style="margin: 10px 0 0 0; color: #666; font-size: 1.1em;">Build datasets using natural language</p>
-    </div>
-</div>
-<br>
-This repository contains the code for the [free Synthetic Data Generator app](https://huggingface.co/spaces/argilla/synthetic-data-generator), which is hosted on the Hugging Face Hub.
-## How it works?
-![Synthetic Data Generator](https://huggingface.co/spaces/argilla/synthetic-data-generator/resolve/main/assets/flow.png)
-Distilabel Synthetic Data Generator is a tool that allows you to easily create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and advanced language models to generate synthetic data tailored to your specific needs.
-This tool simplifies the process of creating custom datasets, enabling you to:
-- Define the characteristics of your desired application
-- Generate system prompts and tasks automatically
-- Create sample datasets for quick iteration
-- Produce full-scale datasets with customizable parameters
-- Push your generated datasets directly to the Hugging Face Hub
-By using Distilabel Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.
-## Do you want to run this locally?
-You can simply clone the repository and run it locally with:
 ```bash
-pip install -r requirements.txt
 python app.py
 ```
-Note that you do need to have an `HF_TOKEN` that can make calls to the free serverless Hugging Face Inference Endpoints. You can get one [here](https://huggingface.co/settings/tokens/new?ownUserPermissions=repo.content.read&ownUserPermissions=repo.write&globalPermissions=inference.serverless.write&tokenType=fineGrained).
-## Do you need more control?
-Each pipeline is based on a distilabel component, so you can easily run it locally or with other LLMs.
 Check out the [distilabel library](https://github.com/argilla-io/distilabel) for more information.

 - inference-api
 ---
+<h1 align="center">
+  <br>
+  🧬 Synthetic Data Generator
+  <br>
+</h1>
+<h3 align="center">Build datasets using natural language</h2>
+![Synthetic Data Generator](https://huggingface.co/spaces/argilla/synthetic-data-generator/resolve/main/assets/ui.png)
+<p align="center">
+<a  href="https://pypi.org/project/synthetic-dataset-generator/">
+<img alt="CI" src="https://img.shields.io/pypi/v/synthetic-dataset-generator.svg?style=flat-round&logo=pypi&logoColor=white">
+</a>
+<a href="https://pepy.tech/project/synthetic-dataset-generator">
+<img alt="CI" src="https://static.pepy.tech/personalized-badge/argilla?period=month&units=international_system&left_color=grey&right_color=blue&left_text=pypi%20downloads/month">
+</a>
+<a href="https://huggingface.co/spaces/argilla/synthetic-data-generator?duplicate=true">
+<img src="https://huggingface.co/datasets/huggingface/badges/raw/main/duplicate-this-space-sm.svg"/>
+</a>
+</p>
+<p align="center">
+<a href="https://twitter.com/argilla_io">
+<img src="https://img.shields.io/badge/twitter-black?logo=x"/>
+</a>
+<a href="https://www.linkedin.com/company/argilla-io">
+<img src="https://img.shields.io/badge/linkedin-blue?logo=linkedin"/>
+</a>
+<a href="http://hf.co/join/discord">
+<img src="https://img.shields.io/badge/Discord-7289DA?&logo=discord&logoColor=white"/>
+</a>
+</p>
+## Introduction
+Synthetic Data Generator is a tool that allows you to create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and LLMs to generate synthetic data tailored to your specific needs.
+Supported Tasks:
+- Text Classification
+- Supervised Fine-Tuning
+- Judging and rationale evaluation
+This tool simplifies the process of creating custom datasets, enabling you to:
+- Describe the characteristics of your desired application
+- Iterate on sample datasets
+- Produce full-scale datasets
+- Push your datasets to the [Hugging Face Hub](https://huggingface.co/datasets?other=datacraft) and/or Argilla
+By using the Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.
+## Installation
+You can simply install the package with:
+```bash
+pip install synthetic-dataset-generator
+```
+### Environment Variables
+- `HF_TOKEN`: Your Hugging Face token to push your datasets to the Hugging Face Hub and run Inference Endpoints Requests. You can get one [here](https://huggingface.co/settings/tokens/new?ownUserPermissions=repo.content.read&ownUserPermissions=repo.write&globalPermissions=inference.serverless.write&tokenType=fineGrained).
+- `ARGILLA_API_KEY`: Your Argilla API key to push your datasets to Argilla.
+- `ARGILLA_API_URL`: Your Argilla API URL to push your datasets to Argilla.
+## Quick Start
 ```bash
 python app.py
 ```
+## Custom synthetic data generation?
+Each pipeline is based on distilabel, so you can easily change the LLM or the pipeline steps.
 Check out the [distilabel library](https://github.com/argilla-io/distilabel) for more information.

app.py CHANGED Viewed

@@ -1,12 +1,10 @@
-import gradio as gr
 from src.distilabel_dataset_generator._tabbedinterface import TabbedInterface
 from src.distilabel_dataset_generator.apps.faq import app as faq_app
 from src.distilabel_dataset_generator.apps.sft import app as sft_app
-from src.distilabel_dataset_generator.apps.eval import app as eval_app
 from src.distilabel_dataset_generator.apps.textcat import app as textcat_app
-theme ='argilla/argilla-theme'
 css = """
 button[role="tab"][aria-selected="true"] { border: 0; background: var(--neutral-800); color: white; border-top-right-radius: var(--radius-md); border-top-left-radius: var(--radius-md)}
@@ -29,9 +27,7 @@ demo = TabbedInterface(
     [textcat_app, sft_app, eval_app, faq_app],
     ["Text Classification", "Supervised Fine-Tuning", "Evaluation", "FAQ"],
     css=css,
-    title="""
-    <h1>Synthetic Data Generator</h1>
-    """,
     head="Synthetic Data Generator",
     theme=theme,
 )

 from src.distilabel_dataset_generator._tabbedinterface import TabbedInterface
+from src.distilabel_dataset_generator.apps.eval import app as eval_app
 from src.distilabel_dataset_generator.apps.faq import app as faq_app
 from src.distilabel_dataset_generator.apps.sft import app as sft_app
 from src.distilabel_dataset_generator.apps.textcat import app as textcat_app
+theme = "argilla/argilla-theme"
 css = """
 button[role="tab"][aria-selected="true"] { border: 0; background: var(--neutral-800); color: white; border-top-right-radius: var(--radius-md); border-top-left-radius: var(--radius-md)}
     [textcat_app, sft_app, eval_app, faq_app],
     ["Text Classification", "Supervised Fine-Tuning", "Evaluation", "FAQ"],
     css=css,
+    title="Synthetic Data Generator",
     head="Synthetic Data Generator",
     theme=theme,
 )

assets/ui.png ADDED Viewed

demo.py DELETED Viewed

@@ -1,61 +0,0 @@
-import gradio as gr
-from src.distilabel_dataset_generator._tabbedinterface import TabbedInterface
-from src.distilabel_dataset_generator.apps.eval import app as eval_app
-from src.distilabel_dataset_generator.apps.faq import app as faq_app
-from src.distilabel_dataset_generator.apps.sft import app as sft_app
-from src.distilabel_dataset_generator.apps.textcat import app as textcat_app
-theme = gr.themes.Monochrome(
-    spacing_size="md",
-    font=[gr.themes.GoogleFont("Inter"), "ui-sans-serif", "system-ui", "sans-serif"],
-)
-css = """
-.main_ui_logged_out{opacity: 0.3; pointer-events: none}
-.tabitem{border: 0px}
-.group_padding{padding: .55em}
-#space_model .wrap > label:last-child{opacity: 0.3; pointer-events:none}
-#system_prompt_examples {
-    color: black;
-}
-@media (prefers-color-scheme: dark) {
-    #system_prompt_examples {
-        color: white;
-        background-color: black;
-    }
-}
-button[role="tab"].selected,
-button[role="tab"][aria-selected="true"],
-button[role="tab"][data-tab-id][aria-selected="true"] {
-    background-color: #000000;
-    color: white;
-    border: none;
-    font-size: 16px;
-    font-weight: bold;
-    box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);
-    transition: background-color 0.3s ease, color 0.3s ease;
-}
-.gallery {
-    color: black !important;
-}
-.flex-shrink-0.truncate.px-1 {
-    color: black !important;
-}
-"""
-demo = TabbedInterface(
-    [textcat_app, sft_app, eval_app, faq_app],
-    ["Text Classification", "Supervised Fine-Tuning", "Evaluation", "FAQ"],
-    css=css,
-    title="""
-    <h1>Synthetic Data Generator</h1>
-    <h3>Build datasets using natural language</h3>
-    """,
-    head="Synthetic Data Generator",
-    theme=theme,
-)
-if __name__ == "__main__":
-    demo.launch()

pdm.lock CHANGED Viewed

@@ -5,7 +5,7 @@
 groups = ["default"]
 strategy = ["inherit_metadata"]
 lock_version = "4.5.0"
-content_hash = "sha256:95ec72fc76abcd69ff04b12eff97512756f853fa88f9489b383d7a97d95193f9"
 [[metadata.targets]]
 requires_python = ">=3.10,<3.13"
@@ -564,7 +564,7 @@ files = [
 [[package]]
 name = "distilabel"
 version = "1.4.1"
-extras = ["argilla", "hf-inference-endpoints", "outlines"]
 requires_python = ">=3.9"
 summary = "Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs."
 groups = ["default"]
@@ -572,6 +572,7 @@ dependencies = [
     "argilla>=2.0.0",
     "distilabel==1.4.1",
     "huggingface-hub>=0.22.0",
     "ipython",
     "numba>=0.54.0",
     "outlines>=0.0.40",
@@ -581,6 +582,28 @@ files = [
     {file = "distilabel-1.4.1.tar.gz", hash = "sha256:0c373be234e8f2982ec7f940d9a95585b15306b6ab5315f5a6a45214d8f34006"},
 ]
 [[package]]
 name = "exceptiongroup"
 version = "1.2.2"
@@ -942,6 +965,30 @@ files = [
     {file = "importlib_resources-6.4.5.tar.gz", hash = "sha256:980862a1d16c9e147a59603677fa2aa5fd82b87f223b6cb870695bcfce830065"},
 ]
 [[package]]
 name = "interegular"
 version = "0.3.3"
@@ -1015,6 +1062,52 @@ files = [
     {file = "jinja2-3.1.4.tar.gz", hash = "sha256:4a3aee7acbbe7303aede8e9648d13b8bf88a429282aa6122a993f0ac800cb369"},
 ]
 [[package]]
 name = "joblib"
 version = "1.4.2"
@@ -1638,6 +1731,27 @@ files = [
     {file = "nvidia_nvtx_cu12-12.4.127-py3-none-win_amd64.whl", hash = "sha256:641dccaaa1139f3ffb0d3164b4b84f9d253397e38246a4f2f36728b48566d485"},
 ]
 [[package]]
 name = "orjson"
 version = "3.10.11"
@@ -2680,6 +2794,17 @@ files = [
     {file = "tblib-3.0.0.tar.gz", hash = "sha256:93622790a0a29e04f0346458face1e144dc4d32f493714c6c3dff82a4adb77e6"},
 ]
 [[package]]
 name = "threadpoolctl"
 version = "3.5.0"

 groups = ["default"]
 strategy = ["inherit_metadata"]
 lock_version = "4.5.0"
+content_hash = "sha256:87e2a6c0c74be28ed570492c4401d430ae5ce4dfad5f015cd3e6b476f9c14f2f"
 [[metadata.targets]]
 requires_python = ">=3.10,<3.13"
 [[package]]
 name = "distilabel"
 version = "1.4.1"
+extras = ["argilla", "hf-inference-endpoints", "instructor", "outlines"]
 requires_python = ">=3.9"
 summary = "Distilabel is an AI Feedback (AIF) framework for building datasets with and for LLMs."
 groups = ["default"]
     "argilla>=2.0.0",
     "distilabel==1.4.1",
     "huggingface-hub>=0.22.0",
+    "instructor>=1.2.3",
     "ipython",
     "numba>=0.54.0",
     "outlines>=0.0.40",
     {file = "distilabel-1.4.1.tar.gz", hash = "sha256:0c373be234e8f2982ec7f940d9a95585b15306b6ab5315f5a6a45214d8f34006"},
 ]
+[[package]]
+name = "distro"
+version = "1.9.0"
+requires_python = ">=3.6"
+summary = "Distro - an OS platform information API"
+groups = ["default"]
+files = [
+    {file = "distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2"},
+    {file = "distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed"},
+]
+[[package]]
+name = "docstring-parser"
+version = "0.16"
+requires_python = ">=3.6,<4.0"
+summary = "Parse Python docstrings in reST, Google and Numpydoc format"
+groups = ["default"]
+files = [
+    {file = "docstring_parser-0.16-py3-none-any.whl", hash = "sha256:bf0a1387354d3691d102edef7ec124f219ef639982d096e26e3b60aeffa90637"},
+    {file = "docstring_parser-0.16.tar.gz", hash = "sha256:538beabd0af1e2db0146b6bd3caa526c35a34d61af9fd2887f3a8a27a739aa6e"},
+]
 [[package]]
 name = "exceptiongroup"
 version = "1.2.2"
     {file = "importlib_resources-6.4.5.tar.gz", hash = "sha256:980862a1d16c9e147a59603677fa2aa5fd82b87f223b6cb870695bcfce830065"},
 ]
+[[package]]
+name = "instructor"
+version = "1.7.0"
+requires_python = "<4.0,>=3.9"
+summary = "structured outputs for llm"
+groups = ["default"]
+dependencies = [
+    "aiohttp<4.0.0,>=3.9.1",
+    "docstring-parser<0.17,>=0.16",
+    "jinja2<4.0.0,>=3.1.4",
+    "jiter<0.7,>=0.6.1",
+    "openai<2.0.0,>=1.52.0",
+    "pydantic-core<3.0.0,>=2.18.0",
+    "pydantic<3.0.0,>=2.8.0",
+    "requests<3.0.0,>=2.32.3",
+    "rich<14.0.0,>=13.7.0",
+    "tenacity<10.0.0,>=9.0.0",
+    "typer<1.0.0,>=0.9.0",
+]
+files = [
+    {file = "instructor-1.7.0-py3-none-any.whl", hash = "sha256:0bff965d71a5398aed9d3f728e07ffb7b5050569c81f306c0e5a8d022071fe29"},
+    {file = "instructor-1.7.0.tar.gz", hash = "sha256:51b308ae9c5e4d56096514be785ac4f28f710c91bed80af74412fc21593431b3"},
+]
 [[package]]
 name = "interegular"
 version = "0.3.3"
     {file = "jinja2-3.1.4.tar.gz", hash = "sha256:4a3aee7acbbe7303aede8e9648d13b8bf88a429282aa6122a993f0ac800cb369"},
 ]
+[[package]]
+name = "jiter"
+version = "0.6.1"
+requires_python = ">=3.8"
+summary = "Fast iterable JSON parser."
+groups = ["default"]
+files = [
+    {file = "jiter-0.6.1-cp310-cp310-macosx_10_12_x86_64.whl", hash = "sha256:d08510593cb57296851080018006dfc394070178d238b767b1879dc1013b106c"},
+    {file = "jiter-0.6.1-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:adef59d5e2394ebbad13b7ed5e0306cceb1df92e2de688824232a91588e77aa7"},
+    {file = "jiter-0.6.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b3e02f7a27f2bcc15b7d455c9df05df8ffffcc596a2a541eeda9a3110326e7a3"},
+    {file = "jiter-0.6.1-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:ed69a7971d67b08f152c17c638f0e8c2aa207e9dd3a5fcd3cba294d39b5a8d2d"},
+    {file = "jiter-0.6.1-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b2019d966e98f7c6df24b3b8363998575f47d26471bfb14aade37630fae836a1"},
+    {file = "jiter-0.6.1-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:36c0b51a285b68311e207a76c385650322734c8717d16c2eb8af75c9d69506e7"},
+    {file = "jiter-0.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:220e0963b4fb507c525c8f58cde3da6b1be0bfddb7ffd6798fb8f2531226cdb1"},
+    {file = "jiter-0.6.1-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:aa25c7a9bf7875a141182b9c95aed487add635da01942ef7ca726e42a0c09058"},
+    {file = "jiter-0.6.1-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:e90552109ca8ccd07f47ca99c8a1509ced93920d271bb81780a973279974c5ab"},
+    {file = "jiter-0.6.1-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:67723a011964971864e0b484b0ecfee6a14de1533cff7ffd71189e92103b38a8"},
+    {file = "jiter-0.6.1-cp310-none-win32.whl", hash = "sha256:33af2b7d2bf310fdfec2da0177eab2fedab8679d1538d5b86a633ebfbbac4edd"},
+    {file = "jiter-0.6.1-cp310-none-win_amd64.whl", hash = "sha256:7cea41c4c673353799906d940eee8f2d8fd1d9561d734aa921ae0f75cb9732f4"},
+    {file = "jiter-0.6.1-cp311-cp311-macosx_10_12_x86_64.whl", hash = "sha256:b03c24e7da7e75b170c7b2b172d9c5e463aa4b5c95696a368d52c295b3f6847f"},
+    {file = "jiter-0.6.1-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:47fee1be677b25d0ef79d687e238dc6ac91a8e553e1a68d0839f38c69e0ee491"},
+    {file = "jiter-0.6.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:25f0d2f6e01a8a0fb0eab6d0e469058dab2be46ff3139ed2d1543475b5a1d8e7"},
+    {file = "jiter-0.6.1-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:0b809e39e342c346df454b29bfcc7bca3d957f5d7b60e33dae42b0e5ec13e027"},
+    {file = "jiter-0.6.1-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e9ac7c2f092f231f5620bef23ce2e530bd218fc046098747cc390b21b8738a7a"},
+    {file = "jiter-0.6.1-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:e51a2d80d5fe0ffb10ed2c82b6004458be4a3f2b9c7d09ed85baa2fbf033f54b"},
+    {file = "jiter-0.6.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3343d4706a2b7140e8bd49b6c8b0a82abf9194b3f0f5925a78fc69359f8fc33c"},
+    {file = "jiter-0.6.1-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:82521000d18c71e41c96960cb36e915a357bc83d63a8bed63154b89d95d05ad1"},
+    {file = "jiter-0.6.1-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:3c843e7c1633470708a3987e8ce617ee2979ee18542d6eb25ae92861af3f1d62"},
+    {file = "jiter-0.6.1-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:a2e861658c3fe849efc39b06ebb98d042e4a4c51a8d7d1c3ddc3b1ea091d0784"},
+    {file = "jiter-0.6.1-cp311-none-win32.whl", hash = "sha256:7d72fc86474862c9c6d1f87b921b70c362f2b7e8b2e3c798bb7d58e419a6bc0f"},
+    {file = "jiter-0.6.1-cp311-none-win_amd64.whl", hash = "sha256:3e36a320634f33a07794bb15b8da995dccb94f944d298c8cfe2bd99b1b8a574a"},
+    {file = "jiter-0.6.1-cp312-cp312-macosx_10_12_x86_64.whl", hash = "sha256:1fad93654d5a7dcce0809aff66e883c98e2618b86656aeb2129db2cd6f26f867"},
+    {file = "jiter-0.6.1-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:4e6e340e8cd92edab7f6a3a904dbbc8137e7f4b347c49a27da9814015cc0420c"},
+    {file = "jiter-0.6.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:691352e5653af84ed71763c3c427cff05e4d658c508172e01e9c956dfe004aba"},
+    {file = "jiter-0.6.1-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:defee3949313c1f5b55e18be45089970cdb936eb2a0063f5020c4185db1b63c9"},
+    {file = "jiter-0.6.1-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:26d2bdd5da097e624081c6b5d416d3ee73e5b13f1703bcdadbb1881f0caa1933"},
+    {file = "jiter-0.6.1-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:18aa9d1626b61c0734b973ed7088f8a3d690d0b7f5384a5270cd04f4d9f26c86"},
+    {file = "jiter-0.6.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7a3567c8228afa5ddcce950631c6b17397ed178003dc9ee7e567c4c4dcae9fa0"},
+    {file = "jiter-0.6.1-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:e5c0507131c922defe3f04c527d6838932fcdfd69facebafd7d3574fa3395314"},
+    {file = "jiter-0.6.1-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:540fcb224d7dc1bcf82f90f2ffb652df96f2851c031adca3c8741cb91877143b"},
+    {file = "jiter-0.6.1-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:e7b75436d4fa2032b2530ad989e4cb0ca74c655975e3ff49f91a1a3d7f4e1df2"},
+    {file = "jiter-0.6.1-cp312-none-win32.whl", hash = "sha256:883d2ced7c21bf06874fdeecab15014c1c6d82216765ca6deef08e335fa719e0"},
+    {file = "jiter-0.6.1-cp312-none-win_amd64.whl", hash = "sha256:91e63273563401aadc6c52cca64a7921c50b29372441adc104127b910e98a5b6"},
+    {file = "jiter-0.6.1.tar.gz", hash = "sha256:e19cd21221fc139fb032e4112986656cb2739e9fe6d84c13956ab30ccc7d4449"},
+]
 [[package]]
 name = "joblib"
 version = "1.4.2"
     {file = "nvidia_nvtx_cu12-12.4.127-py3-none-win_amd64.whl", hash = "sha256:641dccaaa1139f3ffb0d3164b4b84f9d253397e38246a4f2f36728b48566d485"},
 ]
+[[package]]
+name = "openai"
+version = "1.56.0"
+requires_python = ">=3.8"
+summary = "The official Python library for the openai API"
+groups = ["default"]
+dependencies = [
+    "anyio<5,>=3.5.0",
+    "distro<2,>=1.7.0",
+    "httpx<1,>=0.23.0",
+    "jiter<1,>=0.4.0",
+    "pydantic<3,>=1.9.0",
+    "sniffio",
+    "tqdm>4",
+    "typing-extensions<5,>=4.11",
+]
+files = [
+    {file = "openai-1.56.0-py3-none-any.whl", hash = "sha256:0751a6e139a09fca2e9cbbe8a62bfdab901b5865249d2555d005decf966ef9c3"},
+    {file = "openai-1.56.0.tar.gz", hash = "sha256:f7fa159c8e18e7f9a8d71ff4b8052452ae70a4edc6b76a6e97eda00d5364923f"},
+]
 [[package]]
 name = "orjson"
 version = "3.10.11"
     {file = "tblib-3.0.0.tar.gz", hash = "sha256:93622790a0a29e04f0346458face1e144dc4d32f493714c6c3dff82a4adb77e6"},
 ]
+[[package]]
+name = "tenacity"
+version = "9.0.0"
+requires_python = ">=3.8"
+summary = "Retry code until it succeeds"
+groups = ["default"]
+files = [
+    {file = "tenacity-9.0.0-py3-none-any.whl", hash = "sha256:93de0c98785b27fcf659856aa9f54bfbd399e29969b0621bc7f762bd441b4539"},
+    {file = "tenacity-9.0.0.tar.gz", hash = "sha256:807f37ca97d62aa361264d497b0e31e92b8027044942bfa756160d908320d73b"},
+]
 [[package]]
 name = "threadpoolctl"
 version = "3.5.0"

pyproject.toml CHANGED Viewed

@@ -1,7 +1,7 @@
 [project]
 name = "distilabel-dataset-generator"
 version = "0.1.0"
-description = "Default template for PDM package"
 authors = [
     {name = "davidberenstein1957", email = "[email protected]"},
 ]

 [project]
 name = "distilabel-dataset-generator"
 version = "0.1.0"
+description = "Build datasets using natural language"
 authors = [
     {name = "davidberenstein1957", email = "[email protected]"},
 ]

src/distilabel_dataset_generator/__init__.py CHANGED Viewed

@@ -1,6 +1,9 @@
 from pathlib import Path
 from typing import Optional, Union
 import distilabel
 import distilabel.distiset
 from distilabel.utils.card.dataset_card import (
@@ -9,6 +12,29 @@ from distilabel.utils.card.dataset_card import (
 )
 from huggingface_hub import DatasetCardData, HfApi, upload_file
 class CustomDistisetWithAdditionalTag(distilabel.distiset.Distiset):
     def _generate_card(

+import os
+import warnings
 from pathlib import Path
 from typing import Optional, Union
+import argilla as rg
 import distilabel
 import distilabel.distiset
 from distilabel.utils.card.dataset_card import (
 )
 from huggingface_hub import DatasetCardData, HfApi, upload_file
+HF_TOKENS = [os.getenv("HF_TOKEN")] + [os.getenv(f"HF_TOKEN_{i}") for i in range(1, 10)]
+HF_TOKENS = [token for token in HF_TOKENS if token]
+if len(HF_TOKENS) == 0:
+    raise ValueError(
+        "HF_TOKEN is not set. Ensure you have set the HF_TOKEN environment variable that has access to the Hugging Face Hub repositories and Inference Endpoints."
+    )
+ARGILLA_API_URL = os.getenv("ARGILLA_API_URL")
+ARGILLA_API_KEY = os.getenv("ARGILLA_API_KEY")
+if ARGILLA_API_URL is None or ARGILLA_API_KEY is None:
+    ARGILLA_API_URL = os.getenv("ARGILLA_API_URL_SDG_REVIEWER")
+    ARGILLA_API_KEY = os.getenv("ARGILLA_API_KEY_SDG_REVIEWER")
+if ARGILLA_API_URL is None or ARGILLA_API_KEY is None:
+    warnings.warn("ARGILLA_API_URL or ARGILLA_API_KEY is not set")
+    argilla_client = None
+else:
+    argilla_client = rg.Argilla(
+        api_url=ARGILLA_API_URL,
+        api_key=ARGILLA_API_KEY,
+    )
 class CustomDistisetWithAdditionalTag(distilabel.distiset.Distiset):
     def _generate_card(

src/distilabel_dataset_generator/apps/base.py CHANGED Viewed

@@ -195,7 +195,7 @@ def validate_argilla_user_workspace_dataset(
     return ""
-def get_org_dropdown(oauth_token: OAuthToken = None):
     orgs = list_orgs(oauth_token)
     return gr.Dropdown(
         label="Organization",
@@ -488,7 +488,7 @@ def show_success_message(org_name, repo_name) -> gr.Markdown:
                 </strong>
             </p>
             <p style="margin-top: 0.5em;">
-                The generated dataset is in the right format for fine-tuning with TRL, AutoTrain, or other frameworks. Your dataset is now available at:
                 <a href="https://huggingface.co/datasets/{org_name}/{repo_name}" target="_blank" style="color: #1565c0; text-decoration: none;">
                     https://huggingface.co/datasets/{org_name}/{repo_name}
                 </a>
@@ -503,5 +503,6 @@ def show_success_message(org_name, repo_name) -> gr.Markdown:
         visible=True,
     )
 def hide_success_message() -> gr.Markdown:
     return gr.Markdown(value="")

     return ""
+def get_org_dropdown(oauth_token: Union[OAuthToken, None]):
     orgs = list_orgs(oauth_token)
     return gr.Dropdown(
         label="Organization",
                 </strong>
             </p>
             <p style="margin-top: 0.5em;">
+                The generated dataset is in the right format for fine-tuning with TRL, AutoTrain, or other frameworks. Your dataset is now available at:
                 <a href="https://huggingface.co/datasets/{org_name}/{repo_name}" target="_blank" style="color: #1565c0; text-decoration: none;">
                     https://huggingface.co/datasets/{org_name}/{repo_name}
                 </a>
         visible=True,
     )
 def hide_success_message() -> gr.Markdown:
     return gr.Markdown(value="")

src/distilabel_dataset_generator/pipelines/base.py CHANGED Viewed

@@ -1,4 +1,4 @@
-from src.distilabel_dataset_generator.utils import HF_TOKENS
 DEFAULT_BATCH_SIZE = 5
 TOKEN_INDEX = 0

+from src.distilabel_dataset_generator import HF_TOKENS
 DEFAULT_BATCH_SIZE = 5
 TOKEN_INDEX = 0

src/distilabel_dataset_generator/utils.py CHANGED Viewed

@@ -1,5 +1,4 @@
 import json
-import os
 from typing import List, Optional, Union
 import argilla as rg
@@ -16,10 +15,10 @@ from gradio.oauth import (
 from huggingface_hub import whoami
 from jinja2 import Environment, meta
 _LOGGED_OUT_CSS = ".main_ui_logged_out{opacity: 0.3; pointer-events: none}"
-HF_TOKENS = [os.getenv("HF_TOKEN")] + [os.getenv(f"HF_TOKEN_{i}") for i in range(1, 10)]
-HF_TOKENS = [token for token in HF_TOKENS if token]
 _CHECK_IF_SPACE_IS_SET = (
     all(
@@ -48,7 +47,7 @@ def get_duplicate_button():
         return gr.DuplicateButton(size="lg")
-def list_orgs(oauth_token: OAuthToken = None):
     try:
         if oauth_token is None:
             return []
@@ -72,7 +71,7 @@ def list_orgs(oauth_token: OAuthToken = None):
     return organizations
-def get_org_dropdown(oauth_token: OAuthToken = None):
     if oauth_token is not None:
         orgs = list_orgs(oauth_token)
     else:
@@ -86,14 +85,14 @@ def get_org_dropdown(oauth_token: OAuthToken = None):
     )
-def get_token(oauth_token: OAuthToken = None):
     if oauth_token:
         return oauth_token.token
     else:
         return ""
-def swap_visibility(oauth_token: Optional[OAuthToken] = None):
     if oauth_token:
         return gr.update(elem_classes=["main_ui_logged_in"])
     else:
@@ -123,18 +122,8 @@ def get_base_app():
 def get_argilla_client() -> Union[rg.Argilla, None]:
-    try:
-        api_url = os.getenv("ARGILLA_API_URL_SDG_REVIEWER")
-        api_key = os.getenv("ARGILLA_API_KEY_SDG_REVIEWER")
-        if api_url is None or api_key is None:
-            api_url = os.getenv("ARGILLA_API_URL")
-            api_key = os.getenv("ARGILLA_API_KEY")
-        return rg.Argilla(
-            api_url=api_url,
-            api_key=api_key,
-        )
-    except Exception:
-        return None
 def get_preprocess_labels(labels: Optional[List[str]]) -> List[str]:
     return list(set([label.lower().strip() for label in labels])) if labels else []

 import json
 from typing import List, Optional, Union
 import argilla as rg
 from huggingface_hub import whoami
 from jinja2 import Environment, meta
+from src.distilabel_dataset_generator import argilla_client
 _LOGGED_OUT_CSS = ".main_ui_logged_out{opacity: 0.3; pointer-events: none}"
 _CHECK_IF_SPACE_IS_SET = (
     all(
         return gr.DuplicateButton(size="lg")
+def list_orgs(oauth_token: Union[OAuthToken, None] = None):
     try:
         if oauth_token is None:
             return []
     return organizations
+def get_org_dropdown(oauth_token: Union[OAuthToken, None] = None):
     if oauth_token is not None:
         orgs = list_orgs(oauth_token)
     else:
     )
+def get_token(oauth_token: Union[OAuthToken, None]):
     if oauth_token:
         return oauth_token.token
     else:
         return ""
+def swap_visibility(oauth_token: Union[OAuthToken, None]):
     if oauth_token:
         return gr.update(elem_classes=["main_ui_logged_in"])
     else:
 def get_argilla_client() -> Union[rg.Argilla, None]:
+    return argilla_client
 def get_preprocess_labels(labels: Optional[List[str]]) -> List[str]:
     return list(set([label.lower().strip() for label in labels])) if labels else []