--- library_name: transformers tags: - donut license: mit language: - es base_model: - naver-clova-ix/donut-base-finetuned-cord-v2 --- # Model Card: Donut Model for Ticket Parsing ## Model Description This is a fine-tuned version of the [Donut](https://huggingface.co/naver-clova-ix/donut-base-finetuned-cord-v2) architecture, specifically tailored for parsing retail receipts. Donut is a transformer-based model designed for document understanding, and it performs OCR-free parsing by directly processing images into structured JSON outputs. This implementation was fine-tuned using a custom dataset of artificial and real receipts. ### Use Case This model is intended to be used for parsing receipts into structured data, extracting information such as item names, quantities, prices, taxes, and total amounts directly from image inputs. ## Dataset The model was trained on a mixture of synthetic and real-world receipts: - **Artificial Receipts**: Generated using a custom tool inspired by SynthDoG and built with OpenCV. The tool simulates various real-world conditions (e.g., Gaussian noise, wrinkles, luminance variations) to enhance the robustness of the model. - **Real Receipts**: A manually parsed dataset of 704 receipts, including a validation set of 200 receipts. ### Data Creation Process The artificial receipts were generated using a combination of background images, fonts, and custom templates to mimic real-world conditions, ensuring the model can handle various types of distortions such as noise, wrinkles, and lighting changes. The real receipts were annotated manually using a custom tool based on the Marimo app, which allowed for structured annotation of receipt elements. ## Training Details - **Hardware**: The model was trained using Google Colab Pro. - **Training Steps**: The model was trained in three main steps of 10 epochs each, totaling 30 epochs. - **Loss Function**: The model was trained using a combination of Levenshtein edit-distance for string similarity and nTED (normalized Tree Edit Distance) for accuracy in tree-based data structures. - **Performance**: The model showed significant improvements when trained with a mix of artificial and real receipts, achieving a validation accuracy of 0.98 and a test accuracy of 0.70. ### Results The model was tested on both validation and test datasets, achieving the following results: - **Validation Accuracy**: 98.37% (final fine-tuned model) - **Test Accuracy**: 69.63% (final fine-tuned model) ## Limitations - **Synthetic Data**: Although artificial receipts helped improve performance, the model may still struggle with unseen or very complex receipt formats that weren't part of the training dataset. - **Real-world Deployment**: Further fine-tuning might be necessary to adapt the model to new types of receipts or different languages. ## Ethical Considerations - **Privacy**: Care should be taken when using this model on personal or sensitive financial data. Ensure compliance with local privacy laws and regulations. - **Bias**: The model was trained on a limited set of receipts, which could result in biases toward certain types of stores or receipt formats. ## How to Use This model is available on Hugging Face and can be used as follows: ```python from transformers import DonutProcessor, VisionEncoderDecoderModel from PIL import Image import json import torch import re # Load model and processor print("Loading Donut model...") processor = DonutProcessor.from_pretrained("pandafm/donut-es") model = VisionEncoderDecoderModel.from_pretrained("pandafm/donut-es") if torch.cuda.is_available(): device = torch.device("cuda") model.to(device) else: model.encoder.to(torch.bfloat16) print("Donut model loaded.") # Open image of a receipt image = Image.open("path_to_receipt_image.jpg") # Process image and generate JSON output pixel_values = processor(image, return_tensors="pt").pixel_values if torch.cuda.is_available(): pixel_values = pixel_values.to(device) else: pixel_values = pixel_values.to(torch.bfloat16) # Convert output to JSON task_prompt = "" decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids decoder_input_ids = decoder_input_ids.to(device) # autoregressively generate sequence result = model.generate( pixel_values, decoder_input_ids=decoder_input_ids, max_length=model.decoder.config.max_position_embeddings, pad_token_id=processor.tokenizer.pad_token_id, eos_token_id=processor.tokenizer.eos_token_id, bad_words_ids=[[processor.tokenizer.unk_token_id]], return_dict_in_generate=True, ) seq = processor.batch_decode(result.sequences)[0] seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "") seq = re.sub(r"<.*?>", "", seq, count=1).strip() # remove first task start token seq = processor.token2json(seq) ``` ## Acknowledgements This model was fine-tuned as part of a research project for a Bachelor's Degree, leveraging the Donut architecture and integrating tools like OpenCV for data generation. The final dataset included both synthetic and real-world receipts to improve robustness in parsing. ## Citation @thesis{pandafm2024DonutES, author = {David Florez Mazuera}, title = {Ticket Parser}, school = {Universidad de Murcia}, year = {2024}, address = {Murcia, España}, month = {June}, type = {Bachelor's thesis}, note = {Gines García Mateos}, url = {}, keywords = {donut, transformers, fine-tune}, }