Hugging Face Discord Community

community
Activity Feed

AI & ML interests

Collaborating towards Good ML!

Recent Activity

discord-community's activity

not-lainย 
updated a Space about 16 hours ago
nataliaElvย 
posted an update 18 days ago
view post
Post
1639
If you are still wondering how the FineWeb2 annotations are done, how to follow the guidelines or how Argilla works, this is your video!

I go through a few samples of the FineWeb2 dataset and classify them based on their educational content. Check it out!

https://www.youtube.com/watch?v=_-ORB4WAVGU
nataliaElvย 
posted an update 24 days ago
view post
Post
1265
How do your annotations for FineWeb2 compare to your teammates'?

I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.

I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates ๐Ÿ˜‚


Do you want to see how your annotations compare to others?
๐Ÿ‘‰ Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations
โœ๏ธ Enter the dataset that you've contributed to and your Hugging Face username.

How were your results?
- Contribute some annotations: data-is-better-together/fineweb-c
- Join your language channel in Rocket chat: HuggingFaceFW/discussion
mkluczekย 
posted an update 25 days ago
view post
Post
1601
First Global and Dense Open Embedding Dataset of Earth! ๐ŸŒ ๐Ÿค—

Introducing the Major TOM embeddings dataset, created in collaboration with CloudFerro S.A. ๐Ÿ”ถ and ฮฆ-lab at the European Space Agency (ESA) ๐Ÿ›ฐ๏ธ. Together with @mikonvergence and Jฤ™drzej S. Bojanowski, we present the first open-access dataset of Copernicus embeddings, offering dense, global coverage across the full acquisition areas of Sentinel-1 and Sentinel-2 sensors.

๐Ÿ’ก Highlights:
๐Ÿ“Š Data: Over 8 million Sentinel-1 & Sentinel-2 images processed, distilling insights from 9.368 trillion pixels of raw data.
๐Ÿง  Models: Foundation models include SigLIP, DINOv2, and SSL4EO.
๐Ÿ“ฆ Scale: 62 TB of raw satellite data processed into 170M+ embeddings.

This project delivers open and free vectorized expansions of Major-TOM/README datasets, setting a new standard for embedding releases and enabling lightweight, scalable ingestion of Earth Observation (EO) data for countless applications.

๐Ÿค— Explore the datasets:
Major-TOM/Core-S2L1C-SSL4EO
Major-TOM/Core-S1RTC-SSL4EO
Major-TOM/Core-S2RGB-DINOv2
Major-TOM/Core-S2RGB-SigLIP

๐Ÿ“– Check paper: Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space (2412.05600)
๐Ÿ’ป Code notebook: https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb
  • 1 reply
ยท
nataliaElvย 
posted an update about 1 month ago
view post
Post
1184
We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard
nroggendorffย 
in discord-community/LevelBot about 1 month ago

Suggestion Discussion

2
#25 opened 2 months ago by
nroggendorff
lunarfluย 
updated a Space about 1 month ago
nataliaElvย 
posted an update about 1 month ago
view post
Post
1633
Would you like to get a high-quality dataset to pre-train LLMs in your language? ๐ŸŒ

At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.

Follow the link below, check if your language is listed and sign up to be a Language Lead!

https://forms.gle/s9nGajBh6Pb9G72J6