donut-es / README.md

Update README.md

55a6ca2 verified 3 months ago

5.63 kB

	---
	library_name: transformers
	tags:
	- donut
	license: mit
	language:
	- es
	base_model:
	- naver-clova-ix/donut-base-finetuned-cord-v2
	---
	# Model Card: Donut Model for Ticket Parsing

	## Model Description
	This is a fine-tuned version of the [Donut](https://huggingface.co/naver-clova-ix/donut-base-finetuned-cord-v2) architecture, specifically tailored for parsing retail receipts. Donut is a transformer-based model designed for document understanding, and it performs OCR-free parsing by directly processing images into structured JSON outputs. This implementation was fine-tuned using a custom dataset of artificial and real receipts.

	### Use Case
	This model is intended to be used for parsing receipts into structured data, extracting information such as item names, quantities, prices, taxes, and total amounts directly from image inputs.

	## Dataset
	The model was trained on a mixture of synthetic and real-world receipts:

	- Artificial Receipts: Generated using a custom tool inspired by SynthDoG and built with OpenCV. The tool simulates various real-world conditions (e.g., Gaussian noise, wrinkles, luminance variations) to enhance the robustness of the model.
	- Real Receipts: A manually parsed dataset of 704 receipts, including a validation set of 200 receipts.

	### Data Creation Process
	The artificial receipts were generated using a combination of background images, fonts, and custom templates to mimic real-world conditions, ensuring the model can handle various types of distortions such as noise, wrinkles, and lighting changes. The real receipts were annotated manually using a custom tool based on the Marimo app, which allowed for structured annotation of receipt elements.

	## Training Details
	- Hardware: The model was trained using Google Colab Pro.
	- Training Steps: The model was trained in three main steps of 10 epochs each, totaling 30 epochs.
	- Loss Function: The model was trained using a combination of Levenshtein edit-distance for string similarity and nTED (normalized Tree Edit Distance) for accuracy in tree-based data structures.
	- Performance: The model showed significant improvements when trained with a mix of artificial and real receipts, achieving a validation accuracy of 0.98 and a test accuracy of 0.70.

	### Results
	The model was tested on both validation and test datasets, achieving the following results:
	- Validation Accuracy: 98.37% (final fine-tuned model)
	- Test Accuracy: 69.63% (final fine-tuned model)

	## Limitations
	- Synthetic Data: Although artificial receipts helped improve performance, the model may still struggle with unseen or very complex receipt formats that weren't part of the training dataset.
	- Real-world Deployment: Further fine-tuning might be necessary to adapt the model to new types of receipts or different languages.

	## Ethical Considerations
	- Privacy: Care should be taken when using this model on personal or sensitive financial data. Ensure compliance with local privacy laws and regulations.
	- Bias: The model was trained on a limited set of receipts, which could result in biases toward certain types of stores or receipt formats.

	## How to Use
	This model is available on Hugging Face and can be used as follows:

	```python
	from transformers import DonutProcessor, VisionEncoderDecoderModel
	from PIL import Image
	import json
	import torch
	import re

	# Load model and processor
	print("Loading Donut model...")

	processor = DonutProcessor.from_pretrained("pandafm/donut-es")
	model = VisionEncoderDecoderModel.from_pretrained("pandafm/donut-es")

	if torch.cuda.is_available():
	device = torch.device("cuda")
	model.to(device)
	else:
	model.encoder.to(torch.bfloat16)
	print("Donut model loaded.")

	# Open image of a receipt
	image = Image.open("path_to_receipt_image.jpg")

	# Process image and generate JSON output
	pixel_values = processor(image, return_tensors="pt").pixel_values
	if torch.cuda.is_available():
	pixel_values = pixel_values.to(device)
	else:
	pixel_values = pixel_values.to(torch.bfloat16)

	# Convert output to JSON
	task_prompt = "<s_cord-v2>"
	decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
	decoder_input_ids = decoder_input_ids.to(device)

	# autoregressively generate sequence
	result = model.generate(
	pixel_values,
	decoder_input_ids=decoder_input_ids,
	max_length=model.decoder.config.max_position_embeddings,
	pad_token_id=processor.tokenizer.pad_token_id,
	eos_token_id=processor.tokenizer.eos_token_id,
	bad_words_ids=[[processor.tokenizer.unk_token_id]],
	return_dict_in_generate=True,
	)
	seq = processor.batch_decode(result.sequences)[0]
	seq = seq.replace(processor.tokenizer.eos_token, "").replace(processor.tokenizer.pad_token, "")
	seq = re.sub(r"<.*?>", "", seq, count=1).strip() # remove first task start token
	seq = processor.token2json(seq)
	```

	## Acknowledgements
	This model was fine-tuned as part of a research project for a Bachelor's Degree, leveraging the Donut architecture and integrating tools like OpenCV for data generation. The final dataset included both synthetic and real-world receipts to improve robustness in parsing.

	## Citation

	@thesis{pandafm2024DonutES,
	author = {David Florez Mazuera},
	title = {Ticket Parser},
	school = {Universidad de Murcia},
	year = {2024},
	address = {Murcia, España},
	month = {June},
	type = {Bachelor's thesis},
	note = {Gines García Mateos},
	url = {},
	keywords = {donut, transformers, fine-tune},
	}