Data Is Better Together

community

Activity Feed

AI & ML interests

Building better datasets together

Recent Activity

davanstrien updated a dataset about 4 hours ago

data-is-better-together/fineweb-c-progress

davanstrien updated a dataset about 14 hours ago

data-is-better-together/fineweb-c

davanstrien new activity 4 days ago

data-is-better-together/fineweb-c:remove duplicate config names

View all activity

data-is-better-together's activity

davanstrien

updated a dataset about 4 hours ago

data-is-better-together/fineweb-c-progress

Viewer • Updated about 4 hours ago • 702 • 674 • 2

davanstrien

updated a dataset about 14 hours ago

data-is-better-together/fineweb-c

Viewer • Updated about 14 hours ago • 31k • 625 • 31

davanstrien

in data-is-better-together/fineweb-c 4 days ago

remove duplicate config names

#6 opened 4 days ago by

davanstrien

remove dan config duplicate

#5 opened 4 days ago by

davanstrien

fix configs

#4 opened 4 days ago by

davanstrien

davidberenstein1957

posted an update 5 days ago

Post

1840

Fine-tuning ModernBERT for text classification using synthetic data generation

From prompt to model in 3 steps.
1 dataset description
20 minutes of generating
60 minutes of fine-tuning on my Macbook Pro

Tutorial: https://nbsanity.com/static/552eb50cbd91bedb4e5b73fddca2664a/fine-tune-modernbert-classifier.html

davanstrien

posted an update 8 days ago

Post

2944

🇸🇰 Hovorte po slovensky? Help build better AI for Slovak!

We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release!

Your contribution will help create better language models for 5+ million Slovak speakers.

Annotate here: data-is-better-together/fineweb-c.

Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community

3 replies

sayakpaul

posted an update 11 days ago

Post

3746

Commits speak louder than words 🤪

* 4 new video models
* Multiple image models, including SANA & Flux Control
* New quantizers -> GGUF & TorchAO
* New training scripts

Enjoy this holiday-special Diffusers release 🤗
Notes: https://github.com/huggingface/diffusers/releases/tag/v0.32.0

davanstrien

posted an update 14 days ago

Post

1671

Introducing FineWeb-C 🌐🎓, a community-built dataset for improving language models in ALL languages.

Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.

318 annotators, 32K+ annotations, 12 languages - and growing! 🌍

data-is-better-together/fineweb-c

burtenshaw

posted an update 16 days ago

Post

2620

People are flexing their end of year stats, so I made this app to show hub stats in a tidy design!

Thanks @Ameeeee and @jfcalvo for the feature from Argilla!
burtenshaw/recap

1 reply

davidberenstein1957

posted an update 16 days ago

Post

1340

🐇 Tumble down the AI rabbit hole without any technical knowledge!

Explore AI models on the Hub by a simple and quick search

Demo: davidberenstein1957/transformers-pipeline-playground

sayakpaul

posted an update 17 days ago

Post

1713

In the past seven days, the Diffusers team has shipped:

1. Two new video models
2. One new image model
3. Two new quantization backends
4. Three new fine-tuning scripts
5. Multiple fixes and library QoL improvements

Coffee on me if someone can guess 1 - 4 correctly.

1 reply

nataliaElv

posted an update 18 days ago

Post

1639

If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU

davidberenstein1957

posted an update 19 days ago

Post

4162

Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code.

Blog: https://huggingface.co/blog/synthetic-data-generator
Space: argilla/synthetic-data-generator

4 replies

nataliaElv

posted an update 24 days ago

Post

1265

How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates 😂

Do you want to see how your annotations compare to others?
👉 Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
✍️ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion

burtenshaw

posted an update 25 days ago

Post

2405

Quick update from week 1 of smol course. The community is taking the driving seat and using the material for their own projects. If you want to do the same, join in!

- we have ongoing translation projects in Korean, Vietnamese, Portuguese, and Spanish
- 3 chapters are ready for students. On topics like, instruction tuning, preference alignment, and parameter efficient fine tuning
- 3 chapters are in progress on evaluation, vision language models, and synthetic data.
- around 780 people have forked the repo to use it for learning, teaching, sharing.

⏭️ Next step is to support people that want to use the course for teaching, content creation, internal knowledge sharing, or anything. If you're into this. Drop an issue or PR

REPO: https://buff.ly/3ZCMKX2
discord channel: https://buff.ly/4f9F8jA

sayakpaul

posted an update 25 days ago

Post

2069

Introducing a high-quality open-preference dataset to further this line of research for image generation.

Despite being such an inseparable component for modern image generation, open preference datasets are a rarity!

So, we decided to work on one with the community!

Check it out here:
https://huggingface.co/blog/image-preferences

7 replies

davidberenstein1957

posted an update 26 days ago

Post

2063

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Open Image Preferences is an Apache 2.0 licensed dataset for text-to-image generation. This dataset contains 10K text-to-image preference pairs across common image generation categories, while using different model families and varying prompt complexities.

https://huggingface.co/blog/image-preferences

sayakpaul

posted an update 26 days ago

Post

2114

The Control family of Flux from @black-forest-labs should be discussed more!

It enables structural controls like ControlNets while being significantly less expensive to run!

So, we're working on a Control LoRA training script 🤗

It's still WIP, so go easy:
https://github.com/huggingface/diffusers/pull/10130

sayakpaul

authored a paper 28 days ago

A Noise is Worth Diffusion Guidance

Paper • 2412.03895 • Published 30 days ago • 28

AI & ML interests

Recent Activity

Team members 15

data-is-better-together's activity

remove duplicate config names

remove dan config duplicate

fix configs