Don't you think we should add a tag "Evaluation" for datasets that are meant to be benchmarks and not for training ?

At least, when someone is collecting a group of datasets from an organization or let's say the whole hub can filter based on that tag and avoid somehow contaminating their "training" data.

reacted to mehd-io's post with ❤️ 12 months ago

Post

We just released the first Text2SQL model for DuckDB 🦆🧠
You can try it out directly here :
motherduckdb/DuckDB-NSQL-7B

2 replies

reacted to Pclanglais's post with ❤️ 12 months ago

Post

Hi everyone,
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.

1 reply

liked a model 12 months ago

NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO

Text Generation • Updated Apr 30, 2024 • 3.35k • • 421

liked a model about 1 year ago

CyberPeace-Institute/Cybersecurity-Knowledge-Graph

Token Classification • Updated Jan 24, 2024 • 273 • 18