The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 88
view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • Jul 9, 2024 • 41
GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks Paper • 2406.12925 • Published Jun 14, 2024 • 23
Beyond Document Page Classification: Design, Datasets, and Challenges Paper • 2308.12896 • Published Aug 24, 2023 • 1