Get error when run sample code

#2
by WYYexperiments - opened

Hi! Thank you for your great work!
When I run the sample code, I get the following error:
Some weights of BeitModel were not initialized from the model checkpoint at cmarkea/dit-base-layout-detection and are newly initialized: ['beit.pooler.layernorm.bias', 'beit.pooler.layernorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Do you have any idea on this? Thank you in advance.

Credit Mutuel Arkea org

Hi WYYexperiments, thank you.
This is not an error but a warning. It comes from the fact that, in order to improve the model's performance, we trained the model on the original cross-entropy loss and a cost function aimed at predicting the bounding boxes. However, this warning will not affect the inference performance.

Hi Cyrile, thank you for the reply.
I get another error
logits = outputs.logits
^^^^^^^^^^^^^^
AttributeError: 'BeitModelOutputWithPooling' object has no attribute 'logits'
The output has two tensor, 'pooler_output' and 'last_hidden_state', no attribute 'logits'
Can I get your help in this?

Credit Mutuel Arkea org

Yes, you are right, sorry.
We should not use AutoModel, but BeitForSemanticSegmentation.
I have modified the example accordingly, and it will work (and the warning will also no longer appear)...

I get this, when I run your code to convert mask to bbox.
bbx, lab = detect_bboxes(mm.numpy())
^^^^^^^^
ValueError: too many values to unpack (expected 2)

Seems the detected_blocks value does not have label information.
May I have the code to do the visualization you put on the model card?

Credit Mutuel Arkea org
β€’
edited Sep 12, 2024

Hi, yes of course, here is an untested and non-debugged code snippet. Feel free to adapt it according to your needs.

from collections import OrderedDict
from PIL import Image

import torch
import matplotlib.pyplot as plt
from einops import rearrange
from torchvision.transforms.functional import pil_to_tensor
from torchvision.utils import draw_segmentation_masks

map_color = OrderedDict(
    [("Caption", "red"),
     ("Footnote", "yellowgreen"),
     ("Formula", "skyblue"),
     ("List-item", "magenta"),
     ("Page-footer", "red"),
     ("Page-header", "darkorange"),
     ("Picture", "gold"),
     ("Section-header", "indigo"),
     ("Table", "sienna"),
     ("Text", "slategray"),
     ("Title", "teal")]
)

segmentation = img_proc.post_process_semantic_segmentation(output, target_sizes=[img.size[::-1]])
img_tensor = pil_to_tensor(img)

colors, masks, labels = [], [], []
for ii, (label, color) in enumerate(map_color.items()):
    mask = segmentation[0] == (ii+1)
    if mask.sum() > 0:
        masks.append(mask)
        labels.append(label)
        colors.append(color)

masks = torch.stack(masks)
drawn_seg = draw_segmentation_masks(img_tensor, masks, alpha=0.5, colors=colors)
im_seg = Image.fromarray(rearrange(drawn_seg, 'C H W -> H W C').numpy())
plt.imshow(im_seg)

Hi Cyrile, thank you for your warm help. Your model has an impressive performance.
Can you share more info about the training you did?
I found we can only found Dit base checkpoints. There is no available object detection checkpoints provided by microsoft online.
I feel we need more build the model by ourselves.

Credit Mutuel Arkea org
Cyrile changed discussion status to closed

Sign up or log in to comment