SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction
Abstract
Table extraction from document images is a challenging AI problem, and labelled data for many content domains is difficult to come by. Existing table extraction datasets often focus on scientific tables due to the vast amount of academic articles that are readily available, along with their source code. However, there are significant layout and typographical differences between tables found across scientific, financial, and other domains. Current datasets often lack the words, and their positions, contained within the tables, instead relying on unreliable OCR to extract these features for training modern machine learning models on natural language processing tasks. Therefore, there is a need for a more general method of obtaining labelled data. We present SynFinTabs, a large-scale, labelled dataset of synthetic financial tables. Our hope is that our method of generating these synthetic tables is transferable to other domains. To demonstrate the effectiveness of our dataset in training models to extract information from table images, we create FinTabQA, a layout large language model trained on an extractive question-answering task. We test our model using real-world financial tables and compare it to a state-of-the-art generative model and discuss the results. We make the dataset, model, and dataset generation code publicly available.
Community
Paper: https://arxiv.org/abs/2412.04262
Dataset: https://huggingface.co/datasets/ethanbradley/synfintabs
Model: https://huggingface.co/ethanbradley/fintabqa
Dataset generation code: https://github.com/ethanbradley/synfintabgen
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- "What is the value of templates?" Rethinking Document Information Extraction Datasets for LLMs (2024)
- Synthetic Data Generation with Large Language Models for Personalized Community Question Answering (2024)
- RedStone: Curating General, Code, Math, and QA Data for Large Language Models (2024)
- Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation (2024)
- Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges (2024)
- A Comparative Study of PDF Parsing Tools Across Diverse Document Categories (2024)
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper