Abstract
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
Community
My summary:
When visualizing the inner workings of vision transformers (ViTs), researchers noticed weird spikes of attention on random background patches. This didn't make sense since the models should focus on foreground objects.
By analyzing the output embeddings, they found a small number of tokens (2%) had super high vector norms, causing the spikes.
The high-norm "outlier" tokens occurred in redundant areas and held less local info but more global info about the image.
Their hypothesis is that ViTs learn to identify unimportant patches and recycle them as temporary storage instead of discarding them. This enables efficient processing but causes issues.
Their fix is simple - just add dedicated "register" tokens that provide storage space, avoiding the recycling side effects.
Models trained with registers have:
- Smoother and more meaningful attention maps
- Small boosts in downstream performance
- Way better object discovery abilities
The registers give ViTs a place to do their temporary computations without messing stuff up. Just a tiny architecture tweak improves interpretability and performance. Sweet!
I think it's cool how they reverse-engineered this model artifact and fixed it with such a small change. More work like this will keep incrementally improving ViTs.
TLDR: Vision transformers recycle useless patches to store data, causing problems. Adding dedicated register tokens for storage fixes it nicely.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions (2023)
- Contrastive Feature Masking Open-Vocabulary Vision Transformer (2023)
- Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding (2023)
- DeViL: Decoding Vision features into Language (2023)
- Patch Is Not All You Need (2023)
Please give a thumbs up to this comment if you found it helpful!
I wonder if it's possible to automate the detection of such "outliers" in a systematic way by applying self-supervised training on the attention values.
This reminds me in some ways of PEFT's Prefix Fine-tuning, and I wonder whether a very similar 'wrapper' type approach could be used, possibly alongside more traditional PEFT approaches, adding this behavior existing HuggingFace, and PyTorch, ViT models?
Proposes Register tokens: identifies (and characterises) visual artifacts in latent maps of vision transformer models (supervised and SSL settings) due to low-information tokens being repurposed; additional input tokens (register tokens) fixes the issue and gives clear latent maps (and improves performance); understand and mitigate peak outlier values in attention maps so that methods like LOST (object discovery) work better with methods like DINOv2. Outlier tokens are found across layers, training iterations, and model sizes; usually appear in patches similar to neighbors (no distinguishing features), and are storing less local patch information (linear probing patch position and reconstruction) and are storing global information and features (image classification on linear probing of outliers is better than normal, but less than CLS). Use extra trainable tokens (append after CLS to input) which are discarded from output after forward pass; model uses these to store global information (similar to Memory Transformers in NLP). BERT (SEP, MASK), DETR (object queries), ViDT (detection) also has extra input tokens; also common in multimodal information aggregation; extra tokens not used (after output) here. Tried the method on supervised (DEIT-3 classification using ViTs on ImageNet), text-supervised (OpenCLIP text-image alignment of ViT-B/16), and self-supervised learning (SSL through DINOv2). Does not degrade performance, reduces outliers in attention maps; adding more helps marginally (results saturate), use four as optimal. Registers sometimes attend to different parts of global context (visualized register tokens in Fig 9). DINOv2 with registers significantly improves object discovery using LOST. Appendix has note on antialiasing for interpolating position embedding (in DINOv2); increase in FLOP under 2% for 4 tokens; also has analysis of LOST performance and qualitative results (attention maps, first principal component, and normalized outputs). From Meta, INRIA.
Links: arxiv, PapersWithCode, GitHub (LOST)
are there any models using it out there?
code?
edit:
i can see e.g. https://github.com/facebookresearch/dinov2 added training with registers
Boosting Vision Transformers: How Register Tokens Enhance Performance!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 45
Browse 45 models citing this paperDatasets citing this paper 0
No dataset linking this paper