I thought big and complex repos would be fun to visualize and they can be! This image is from blanchon/RESISC45, a repo with 31,000 images from Google Earth, each bucketed into one of 45 taxonomies with 700 images per taxonomy:
But more fun is when you find a repository that is structured (naming conventions and directories) in a way that lets you see the inequity in the bytes.
This is most apparent in NLP datasets that are multilingual, similar to the wikimedia/wikipedia dataset. If you zoom in on any of these (or run them yourself in the Space) you'll see a directory or file naming convention using the language abbreviation. Sections that near yellow for directories or files == more bytes devoted to that language.
Here's facebook/multilingual_librispeech:
and mozilla-foundation/common_voice_17_0:
and google/xtreme:
and unolp/CulturaX:
Each dataset shows some imbalance in the languages represented, and this pattern holds true for other types of datasets as well. However, such discrepancies can be harder to spot when folder or file naming conventions prioritize machine over human readability.
Another fun example is the nguha/legalbench dataset, designed to evaluate legal reasoning in LLMs. It provides a clear view of the types of reasoning being tested:
Although you might have to squint to see the labels. This is one where it might be best to head over to the Space https://huggingface.co/spaces/jsulz/repo-info and see it for yourself ;)