How do your annotations for FineWeb2 compare to your teammates'?
I started contributing some annotations to the FineWeb2 collaborative annotation sprint and I wanted to know if my labelling trends were similar to those of my teammates.
I did some analysis and I wasn't surprised to see that I'm being a bit harsher on my evaluations than my mates ๐
Do you want to see how your annotations compare to others? ๐ Go to this Gradio space: nataliaElv/fineweb2_compare_my_annotations โ๏ธ Enter the dataset that you've contributed to and your Hugging Face username.
First Global and Dense Open Embedding Dataset of Earth! ๐ ๐ค
Introducing the Major TOM embeddings dataset, created in collaboration with CloudFerro S.A. ๐ถ and ฮฆ-lab at the European Space Agency (ESA) ๐ฐ๏ธ. Together with @mikonvergence and Jฤdrzej S. Bojanowski, we present the first open-access dataset of Copernicus embeddings, offering dense, global coverage across the full acquisition areas of Sentinel-1 and Sentinel-2 sensors.
๐ก Highlights: ๐ Data: Over 8 million Sentinel-1 & Sentinel-2 images processed, distilling insights from 9.368 trillion pixels of raw data. ๐ง Models: Foundation models include SigLIP, DINOv2, and SSL4EO. ๐ฆ Scale: 62 TB of raw satellite data processed into 170M+ embeddings.
This project delivers open and free vectorized expansions of Major-TOM/README datasets, setting a new standard for embedding releases and enabling lightweight, scalable ingestion of Earth Observation (EO) data for countless applications.
We're so close to reaching 100 languages! Can you help us cover the remaining 200? Check if we're still looking for language leads for your language: nataliaElv/language-leads-dashboard
Would you like to get a high-quality dataset to pre-train LLMs in your language? ๐
At Hugging Face we're preparing a collaborative annotation effort to build an open-source multilingual dataset as part of the Data is Better Together initiative.
Follow the link below, check if your language is listed and sign up to be a Language Lead!