💬 Chat as a way to query SQL! The Airtrain AI team is happy to share a new Hugging Face Space that lets you interact with Hugging Face Hub datasets using a natural language chatbot. 🤗
This Space is forked from davidberenstein1957/text-to-sql-hub-datasets by @davidberenstein1957 and features chat capability with improved table naming. The tool works with Hugging Face’s recently released in-browser DuckDB-based SQL query engine for datasets.
Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. 📚
This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:
- Exact-match deduplication across all crawls - Embeddings for each row using the TaylorAI/bge-micro model - Count column indicating duplication frequency - Includes data from 95 Common Crawl crawls (2013-2024) - Rows have been reduced from 1.279B to 0.324B after deduplication - It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)
Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks to @underspirit for pointing out the reduction in dataset size that could be achieved via deduplication. 🤗