Historical Russian TrOCR Model for Civil Script (ru-trocr-1700s)

Model Description

This model is specifically trained to recognize Russian Civil Script (гражданский шрифт) from the 18th century. It handles the following character sets:

Historical letters: ѣ, і, ѳ, ѵ, ъ
Civil script variations of standard Cyrillic characters
Both uppercase and lowercase variants
Special typographic features of 18th-century printing

Model Performance Metrics

Character Error Rate (CER): 1.69%
Word Error Rate (WER): 5.75%
Sequence Accuracy: 80.21%
Training Loss: 0.0403
Evaluation Loss: 0.0351

Training Details

Base Model: TrOCR
Training Duration: ~25.5 hours
Epochs: 3
Steps: 1227
Training Samples per Second: 0.428
Special Focus: Civil script character recognition including historical letters and their variants
Training Data: 18th-century Russian books from the National Library of Russia

Historical Context

The model is trained on texts printed in Civil Script (гражданский шрифт), introduced by Peter the Great's reform in 1708. This script represents a significant transition in Russian typography from Church Slavonic to a more modernized form of writing. The Civil Script remained the standard for Russian publishing houses and typographers until the 1830s, making it the primary typeface for Russian printed books throughout the 18th and early 19th centuries.

Limitations and Recommendations

Optimized for line-level recognition of historical Russian texts in Civil Script
Best performance on well-segmented lines
May require pre-processing for damaged or low-quality images
Specifically tuned for 18th-century Russian printing conventions

Usage Example

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image

processor = TrOCRProcessor.from_pretrained("taiga75/ru-trocr-1700s")
model = VisionEncoderDecoderModel.from_pretrained("taiga75/ru-trocr-1700s")

# Process image
image = Image.open("path_to_image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values

# Generate text
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Citation

If you use this model in your research, please cite:

@misc{maria_levchenko_2025,
    author       = {{Maria Levchenko}},
    title        = {ru-trocr-1700s (Revision 8d7a9f4)},
    year         = 2025,
    url          = {https://huggingface.co/taiga75/ru-trocr-1700s},
    doi          = {10.57967/hf/3942},
    publisher    = {Hugging Face}
}