Florence 2 Medieval Zone Object Detection

This is Microsoft's Florence 2 model trained for 10 epochs with CATMuS Medieval Segmentation dataset with a learn rate of 1e-6. This model would not be possible without the numerous annotators behind the various datasets available on HTR-United (See dataset for details). A special thanks to Thibault ClΓ©rice who converted the original CATMuS dataset (for HTR) to a segmentation dataset.

Model Details

Labels

The following table describes the labels, the ones used to train this model, the counts of those labels (multiples per image), and the definition of those labels with a link to the original documentation.

Label Zone Line Train Count Validation Count Test Count Definition
DefaultLine βœ“ 81702 13554 12209 A line of text that is not distinguished by any particular features and is part of the main text flow.
InterlinearLine βœ“ 2808 27 2234 A line of text written between two lines of main text, typically containing glosses, translations, or comments.
MainZone βœ“ 2314 365 275 The main textual zone of a page, usually containing the main body of text.
HeadingLine βœ“ 1381 701 135 A line of text that functions as a heading or title for a section of the main text.
MarginTextZone βœ“ 916 146 199 A text zone in the margin of a page, often containing annotations, commentaries, or other secondary information.
DropCapitalZone βœ“ 1566 102 124 A zone containing a large ornamental initial letter of a paragraph or section, typically extending below the first line of text.
NumberingZone βœ“ 632 102 94 A zone containing page numbers, folio numbers, or other numerical identifiers for the page.
TironianSignLine 282 0 0 A line containing Tironian notes, an ancient system of shorthand.
DropCapitalLine 1175 105 92 A line of text that begins with a drop capital.
RunningTitleZone βœ“ 340 91 18 A zone containing a running title, typically located at the top of a page and repeating throughout a section or the entire document.
GraphicZone βœ“ 300 7 10 A zone containing non-textual elements such as images, drawings, or decorative elements.
DigitizationArtefactZone 28 0 0 A zone containing artefacts from the digitization process, such as color bars or reference marks.
QuireMarksZone βœ“ 86 9 8 A zone containing marks used to indicate the gathering or quire to which a leaf belongs, often found at the bottom of the page.
StampZone βœ“ 39 5 4 A zone containing a stamp, such as a library stamp or ownership mark.
DamageZone βœ“ 12 1 0 A zone indicating an area of the page that has been damaged or is otherwise illegible due to physical deterioration.
MusicZone βœ“ 179 0 0 A zone containing musical notation.
MusicLine 167 0 0 A line containing musical notation.
TitlePageZone βœ“ 4 1 1 A zone encompassing the entire title page of a book or document.
SealZone βœ“ 3 0 0 A zone containing a seal, typically used for authentication or closure of a document.

How to Get Started with the Model

Use the code below to get started with the model. All models are trained with float16.

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
import os
from unittest.mock import patch

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.dynamic_module_utils import get_imports
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Mac solution => https://huggingface.co/microsoft/Florence-2-large-ft/discussions/4
def fixed_get_imports(filename: str | os.PathLike) -> list[str]:
    """Work around for https://huggingface.co/microsoft/phi-1_5/discussions/72."""
    if not str(filename).endswith("/modeling_florence2.py"):
        return get_imports(filename)
    imports = get_imports(filename)
    imports.remove("flash_attn")
    return imports


with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):

    model = AutoModelForCausalLM.from_pretrained("medieval-data/florence2-medieval-bbox-zone-detection", trust_remote_code=True)
    processor = AutoProcessor.from_pretrained("medieval-data/florence2-medieval-bbox-zone-detection", trust_remote_code=True)

def process_image(url):
    prompt = "<OD>"

    image = Image.open(requests.get(url, stream=True).raw)

    inputs = processor(text=prompt, images=image, return_tensors="pt")

    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

    result = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
    return result, image


image = "https://huggingface.co/datasets/CATMuS/medieval-segmentation/resolve/main/data/train/cambridge-corpus-christi-college-ms-111/page-002-of-003.jpg"

result, image = process_image(image)
fig, ax = plt.subplots(1, figsize=(15, 15))
ax.imshow(image)

# Add bounding boxes and labels to the plot
for bbox, label in zip(result['<OD>']['bboxes'], result['<OD>']['labels']):
    x, y, width, height = bbox[0], bbox[1], bbox[2] - bbox[0], bbox[3] - bbox[1]
    rect = patches.Rectangle((x, y), width, height, linewidth=2, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    plt.text(x, y, label, fontsize=12, bbox=dict(facecolor='yellow', alpha=0.5))

# Display the plot
plt.show()
Downloads last month
15
Safetensors
Model size
271M params
Tensor type
F32
Β·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train medieval-data/Florence-2-base-medieval-zone-detection

Space using medieval-data/Florence-2-base-medieval-zone-detection 1