Ben Gowan's picture
2 2

Ben Gowan

bgowan
·

AI & ML interests

None yet

Recent Activity

Organizations

None yet

bgowan's activity

reacted to alielfilali01's post with 👍 3 months ago
view post
Post
2575
Don't you think we should add a tag "Evaluation" for datasets that are meant to be benchmarks and not for training ?

At least, when someone is collecting a group of datasets from an organization or let's say the whole hub can filter based on that tag and avoid somehow contaminating their "training" data.
reacted to mehd-io's post with ❤️ 12 months ago
reacted to Pclanglais's post with ❤️ 12 months ago
view post
Post
Hi everyone,
For my first post, I'm announcing a big release (in multiple ways): probably the largest open corpus in French to date, with 85 billion words in the public domain.
The dataset has been prepared in collaboration with Benoît de Courson and Benjamin Azoulay from Gallicagram (https://shiny.ens-paris-saclay.fr/app/gallicagram). Gallicagram is a major cultural analytics project in French, the open and better version of ngram viewer for large scale search of word and ngram occurrences.
The corpus is made of two different dataset for monographs (16B words) PleIAs/French-PD-Newspapers and newspapers/periodicals (69B) PleIAs/French-PD-Newspapers Along with the full text it also includes core provenance metadata.
Beyond research in digital humanities, the corpus can also be used to training open and reproducible LLMs. Being in the public domain means it can be released everywhere in any shape without restrictions.
The corpus is not perfect: digitization of cultural heritage is challenging and, especially for newspapers, we tackle with layout issues and a significant rate of optical character recognition mistake. Our conviction is that releasing corpus as a commons is the best way to improve on this. Sharing is caring.
  • 1 reply
·