--- language: ru tags: - historical-text - russian - ocr - trocr license: mit metrics: - cer - wer --- # Historical Russian TrOCR Model for Civil Script (ru-trocr-1700s) ## Model Description This model is specifically trained to recognize Russian Civil Script (гражданский шрифт) from the 18th century. It handles the following character sets: - Historical letters: ѣ, і, ѳ, ѵ, ъ - Civil script variations of standard Cyrillic characters - Both uppercase and lowercase variants - Special typographic features of 18th-century printing ## Model Performance Metrics - Character Error Rate (CER): 1.69% - Word Error Rate (WER): 5.75% - Sequence Accuracy: 80.21% - Training Loss: 0.0403 - Evaluation Loss: 0.0351 ## Training Details - Base Model: TrOCR - Training Duration: ~25.5 hours - Epochs: 3 - Steps: 1227 - Training Samples per Second: 0.428 - Special Focus: Civil script character recognition including historical letters and their variants - Training Data: 18th-century Russian books from the National Library of Russia ## Historical Context The model is trained on texts printed in Civil Script (гражданский шрифт), introduced by Peter the Great's reform in 1708. This script represents a significant transition in Russian typography from Church Slavonic to a more modernized form of writing. The Civil Script remained the standard for Russian publishing houses and typographers until the 1830s, making it the primary typeface for Russian printed books throughout the 18th and early 19th centuries. ![Russian Civil Script Alphabet](XVIII_century_Russian_font.png) ## Limitations and Recommendations - Optimized for line-level recognition of historical Russian texts in Civil Script - Best performance on well-segmented lines - May require pre-processing for damaged or low-quality images - Specifically tuned for 18th-century Russian printing conventions ## Usage Example ```python from transformers import TrOCRProcessor, VisionEncoderDecoderModel from PIL import Image processor = TrOCRProcessor.from_pretrained("taiga75/ru-trocr-1700s") model = VisionEncoderDecoderModel.from_pretrained("taiga75/ru-trocr-1700s") # Process image image = Image.open("path_to_image").convert("RGB") pixel_values = processor(image, return_tensors="pt").pixel_values # Generate text generated_ids = model.generate(pixel_values) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` ## Citation If you use this model in your research, please cite: ```bibtex @misc{maria_levchenko_2025, author = {{Maria Levchenko}}, title = {ru-trocr-1700s (Revision 8d7a9f4)}, year = 2025, url = {https://huggingface.co/taiga75/ru-trocr-1700s}, doi = {10.57967/hf/3942}, publisher = {Hugging Face} } ```