Hi,
I am trying to use https://huggingface.co/docs/transformers/model_doc/vit_mae: More specifically, I have an image and a mask which specifies the parts of the image I’d like to reconstruct.
As I understand the paper, the model is designed for this tasks, but looking into the code and demos I always find that the masks is generated by the forward method of the mae model.
- Is my understanding correct or am I missing some essential parts?
- Is there a way to achieve my goal without changing too much on the original code?
Thanks for your help!