view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs By Pclanglais • Mar 20, 2024 • 18
BrainGPT/train_valid_split_pmc_neuroscience_2002-2022_filtered_subset Viewer • Updated Mar 2, 2024 • 445k • 39 • 7
PDF Document / OCR Datasets Collection Document datasets with .pdf files that are usable with pixparse libraries and tools. • 2 items • Updated Mar 30, 2024 • 47
GeorgiaTechResearchInstitute/galactica-30b-evol-instruct-70k Text Generation • Updated Jun 27, 2023 • 7.99k • 23