Historical Russian TrOCR Model for Civil Script (ru-trocr-1700s)

Model Description

This model is specifically trained to recognize Russian Civil Script (гражданский шрифт) from the 18th century. It handles the following character sets:

  • Historical letters: ѣ, і, ѳ, ѵ, ъ
  • Civil script variations of standard Cyrillic characters
  • Both uppercase and lowercase variants
  • Special typographic features of 18th-century printing

Model Performance Metrics

  • Character Error Rate (CER): 1.69%
  • Word Error Rate (WER): 5.75%
  • Sequence Accuracy: 80.21%
  • Training Loss: 0.0403
  • Evaluation Loss: 0.0351

Training Details

  • Base Model: TrOCR
  • Training Duration: ~25.5 hours
  • Epochs: 3
  • Steps: 1227
  • Training Samples per Second: 0.428
  • Special Focus: Civil script character recognition including historical letters and their variants
  • Training Data: 18th-century Russian books from the National Library of Russia

Historical Context

The model is trained on texts printed in Civil Script (гражданский шрифт), introduced by Peter the Great's reform in 1708. This script represents a significant transition in Russian typography from Church Slavonic to a more modernized form of writing. The Civil Script remained the standard for Russian publishing houses and typographers until the 1830s, making it the primary typeface for Russian printed books throughout the 18th and early 19th centuries.

Russian Civil Script Alphabet

Limitations and Recommendations

  • Optimized for line-level recognition of historical Russian texts in Civil Script
  • Best performance on well-segmented lines
  • May require pre-processing for damaged or low-quality images
  • Specifically tuned for 18th-century Russian printing conventions

Usage Example

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("taiga75/ru-trocr-1700s")
model = VisionEncoderDecoderModel.from_pretrained("taiga75/ru-trocr-1700s")

# Process image
image = Image.open("path_to_image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

# Generate text
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Citation

If you use this model in your research, please cite:

@misc{maria_levchenko_2025,
    author       = {{Maria Levchenko}},
    title        = {ru-trocr-1700s (Revision 8d7a9f4)},
    year         = 2025,
    url          = {https://huggingface.co/taiga75/ru-trocr-1700s},
    doi          = {10.57967/hf/3942},
    publisher    = {Hugging Face}
}
Downloads last month
15
Safetensors
Model size
334M params
Tensor type
F32
·
Inference API
Unable to determine this model's library. Check the docs .