|
--- |
|
library_name: transformers |
|
tags: |
|
- donut |
|
license: mit |
|
language: |
|
- es |
|
base_model: |
|
- naver-clova-ix/donut-base-finetuned-cord-v2 |
|
--- |
|
# Model Card: Donut Model for Ticket Parsing |
|
|
|
## Model Description |
|
This is a fine-tuned version of the [Donut](https://huggingface.co/naver-clova-ix/donut-base-finetuned-cord-v2) architecture, specifically tailored for parsing retail receipts. Donut is a transformer-based model designed for document understanding, and it performs OCR-free parsing by directly processing images into structured JSON outputs. This implementation was fine-tuned using a custom dataset of artificial and real receipts. |
|
|
|
### Use Case |
|
This model is intended to be used for parsing receipts into structured data, extracting information such as item names, quantities, prices, taxes, and total amounts directly from image inputs. |
|
|
|
## Dataset |
|
The model was trained on a mixture of synthetic and real-world receipts: |
|
|
|
- **Artificial Receipts**: Generated using a custom tool inspired by SynthDoG and built with OpenCV. The tool simulates various real-world conditions (e.g., Gaussian noise, wrinkles, luminance variations) to enhance the robustness of the model. |
|
- **Real Receipts**: A manually parsed dataset of 704 receipts, including a validation set of 200 receipts. |
|
|
|
### Data Creation Process |
|
The artificial receipts were generated using a combination of background images, fonts, and custom templates to mimic real-world conditions, ensuring the model can handle various types of distortions such as noise, wrinkles, and lighting changes. The real receipts were annotated manually using a custom tool based on the Marimo app, which allowed for structured annotation of receipt elements. |
|
|
|
## Training Details |
|
- **Hardware**: The model was trained using Google Colab Pro. |
|
- **Training Steps**: The model was trained in three main steps of 10 epochs each, totaling 30 epochs. |
|
- **Loss Function**: The model was trained using a combination of Levenshtein edit-distance for string similarity and nTED (normalized Tree Edit Distance) for accuracy in tree-based data structures. |
|
- **Performance**: The model showed significant improvements when trained with a mix of artificial and real receipts, achieving a validation accuracy of 0.98 and a test accuracy of 0.70. |
|
|
|
### Results |
|
The model was tested on both validation and test datasets, achieving the following results: |
|
- **Validation Accuracy**: 98.37% (final fine-tuned model) |
|
- **Test Accuracy**: 69.63% (final fine-tuned model) |
|
|
|
## Limitations |
|
- **Synthetic Data**: Although artificial receipts helped improve performance, the model may still struggle with unseen or very complex receipt formats that weren't part of the training dataset. |
|
- **Real-world Deployment**: Further fine-tuning might be necessary to adapt the model to new types of receipts or different languages. |
|
|
|
## Ethical Considerations |
|
- **Privacy**: Care should be taken when using this model on personal or sensitive financial data. Ensure compliance with local privacy laws and regulations. |
|
- **Bias**: The model was trained on a limited set of receipts, which could result in biases toward certain types of stores or receipt formats. |
|
|
|
## How to Use |
|
This model is available on Hugging Face and can be used as follows: |
|
|
|
```python |
|
from transformers import DonutProcessor, VisionEncoderDecoderModel |
|
from PIL import Image |
|
import json |
|
import torch |
|
import re |
|
|
|
# Load model and processor |
|
print("Loading Donut model...") |
|
|
|
processor = DonutProcessor.from_pretrained("pandafm/donut-es") |
|
model = VisionEncoderDecoderModel.from_pretrained("pandafm/donut-es") |
|
|
|
if torch.cuda.is_available(): |
|
device = torch.device("cuda") |
|
model.to(device) |
|
else: |
|
model.encoder.to(torch.bfloat16) |
|
print("Donut model loaded.") |
|
|
|
# Open image of a receipt |
|
image = Image.open("path_to_receipt_image.jpg") |
|
|
|
# Process image and generate JSON output |
|
pixel_values = processor(image, return_tensors="pt").pixel_values |
|
if torch.cuda.is_available(): |
|
pixel_values = pixel_values.to(device) |
|
else: |
|
pixel_values = pixel_values.to(torch.bfloat16) |
|
|
|
# Convert output to JSON |
|
task_prompt = "<s_cord-v2>" |
|
decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids |
|
decoder_input_ids = decoder_input_ids.to(device) |
|
|
|
# autoregressively generate sequence |
|
result = model.generate( |
|
pixel_values, |
|
decoder_input_ids=decoder_input_ids, |
|
max_length=model.decoder.config.max_position_embeddings, |
|
pad_token_id=processor.tokenizer.pad_token_id, |
|
eos_token_id=processor.tokenizer.eos_token_id, |
|
bad_words_ids=[[processor.tokenizer.unk_token_id]], |
|
return_dict_in_generate=True, |
|
) |
|
seq = processor.batch_decode(result.sequences)[0] |
|
seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") |
|
seq = re.sub(r"<.*?>", "", seq, count=1).strip() # remove first task start token |
|
seq = processor.token2json(seq) |
|
``` |
|
|
|
## Acknowledgements |
|
This model was fine-tuned as part of a research project for a Bachelor's Degree, leveraging the Donut architecture and integrating tools like OpenCV for data generation. The final dataset included both synthetic and real-world receipts to improve robustness in parsing. |
|
|
|
## Citation |
|
|
|
@thesis{pandafm2024DonutES, |
|
author = {David Florez Mazuera}, |
|
title = {Ticket Parser}, |
|
school = {Universidad de Murcia}, |
|
year = {2024}, |
|
address = {Murcia, España}, |
|
month = {June}, |
|
type = {Bachelor's thesis}, |
|
note = {Gines García Mateos}, |
|
url = {}, |
|
keywords = {donut, transformers, fine-tune}, |
|
} |