Papers
arxiv:2410.23168

TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Published on Oct 30, 2024
· Submitted by Haiyang-W on Oct 31, 2024
#2 Paper of the day
Authors:
,
,
,
,
,
,

Abstract

Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at https://github.com/Haiyang-W/TokenFormer.

Community

Paper author Paper submitter

A fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of the neural network.
We not only tokenizes data but also model parameters, replacing the model concept with interaction flows between data and parameter tokens, further advancing the network architecture towards unification.
TokenFormer.png

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

A video summary is available here - https://aipapersacademy.com/tokenformer/

Hi, this was a wonderful read. We summarised this paper and a few others in our biweekly blog.

  1. nGPT: Normalized Transformer with Representation Learning on the Hypersphere
  2. LAUREL: Learned Augmented Residual Layer
  3. TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Please give a read and share your thoughts/feedback.

Thank you for this paper @Haiyang-W . Wanted to clarify model dim changing between the multiple tokenformer checkpoints you released. As per the paper, the model dim is kept fixed, and the K and V matrices are progressively extended, however as per the model checkpoints (eg: https://huggingface.co/Haiyang-W/TokenFormer-1-5B and https://huggingface.co/Haiyang-W/TokenFormer-150M) they have different model dims. Also, the model readmes mention that each of them was trained on 300B tokens, however the paper mentions that the initial 124M was trained on 300B tokens, then 15/30/60B tokens for each progressive scale up. Could you clarify these two discrepancies?

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.23168 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.23168 in a Space README.md to link it from this page.

Collections including this paper 12