I am currently trying to implement image classification using I-JEPA. The paper from Yann Lecun ([2111.06377] Masked Autoencoders Are Scalable Vision Learners) mentioned that it could be applied for image classification, which piqued my interest in exploring its usage for the same. However, I’m facing a bit of confusion when it comes to actual implementation.
From the repository provided on GitHub, I am finding it hard to understand how to modify the model to add a linear classifier. More so, I am unclear about how to re-train the model on my data. The pre-trained models available on their GitHub are also present, but I must admit, I am finding it difficult to grasp how to leverage these for my purpose.
Could anyone who has some experience with I-JEPA help me understand the process? Any guidance on how to adapt the model for image classification, and potentially how to use the pre-trained models, would be extremely appreciated.
Looking forward to your suggestions and guidance. Thanks in advance!
The paper you refer to (MAE, or masked autoencoders) is available as ViTMAEForImageClassification in the Transformers library. It adds a linear classifier on top of the base ViTMAEModel. There’s also the ViTMAEForPreTraining class which adds the decoder used for pre-training.
How would you go about adding it to the library? I’d love to do that. @nielsr
I have managed to add a classification layer on top of the I_JEPA encoder, but it seems to not be very accurate and takes a while to train for now. I need to play with the hyperparameters a bit, but still find it strange that the out of the box classification abilities are so limited.