arxiv:2401.00254

Masked Image Modeling via Dynamic Token Morphing

Published on Dec 30, 2023

Authors:

Byeongho Heo

Abstract

Masked Image Modeling (MIM) arises as a promising option for Vision Transformers among various self-supervised learning (SSL) methods. The essence of MIM lies in token-wise masked patch predictions, with targets patchified from images; or generated by pre-trained tokenizers or models. We argue targets from the pre-trained models usually exhibit spatial inconsistency, which makes it excessively challenging for the model to follow to learn more discriminative representations. To mitigate the issue, we introduce a novel self-supervision signal based on Dynamic Token Morphing (DTM), which dynamically aggregates contextually related tokens. DTM can be generally applied to various SSL frameworks, yet we propose a simple MIM that employs DTM to effectively improve the performance barely introducing extra training costs. Our experiments on ImageNet-1K and ADE20K evidently demonstrate the superiority of our methods. Furthermore, the comparative evaluation of iNaturalist and Fine-grained Visual Classification datasets further validates the transferability of our method on various downstream tasks. Our code will be released publicly.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.00254 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.00254 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.00254 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.