Papers
arxiv:2412.17739

Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Published on Dec 23, 2024
· Submitted by iseesaw on Dec 25, 2024
#1 Paper of the day
Authors:
,
,
,
,
,
,
,

Abstract

Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.

Community

Paper author Paper submitter

Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE’s limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention’s frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.

Paper author

It is interesting that not only Attention influences length generalization, but Linear Layers and Activation Functions also play a role.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Seems like a definitive way to go!

·

Thanks!

it would be interesting to see if this benefit tends to be more additive or multiplicative for context length.

If for example the model gets trained with max seq length 512 and FoPE makes it so the performance is still at 80% with context length 2048 (compared to context length 512). So FoPE could have either set the 80% performance mark to max_seq_length * 4 or to max_seq_length + 1536.

how does the 80% performance mark move when scaling up the max sequence length during training? If we train on 8k sequence length, will the 80% performance mark be around 32k or closer to 10k context length?

·

Probably, the benefit tends to be more multiplicative, which can be inferred based on the result from figure 3 in the original paper. When extrapolate the training length to 1k, the performance keeps at 80% until 16k context length.

fig3.png

@iseesaw @Messi-Hua
Is it possible for you to open source your training and data-processing code, as well as which data exactly you used for training and how many steps/epochs you trained for please?
I've been looking for a good baseline setup that i can go to when experimenting with new model architectures. If you were so kind to make your setup available, i could directly compare my experimental results to yours as a baseline

·
Paper author

Sure, we will open source our code as soon as possible, likely within a week, as the main contributors of FoPE are currently in their exam period.

Our code is totally based on OLMo(https://github.com/allenai/OLMo), which already provides the training and data preprocessing code. The core code of FoPE and main setup is available in the appendix of our paper. If you'd like to reproduce the results immediately, you can also integrate FoPE directly into the OLMo repository.

I want to check my understanding of the ideas presented in this paper (I don't have a DSP background)

  1. RoPE can be (ideally) seen as attributing positional information encoded as a phase shift/spin of some multiple of predetermined frequencies associated with each dimension.
  2. Unfortunately, the operations both within attention and outside of it cause spectral leakage of these positional information (e.g. additive superposition of multiple dimensions mixed into the same dimension) and distortion due to nonlinearity as well.
  3. The hypothesis is that gradient based training is ill-suited to learn the true positional invariances/symmetries under these spectral damages. As such, it likely learns to overfit/memorize how to extract positional information based on seen-during-training data without learning to generalize.
  4. Additionally (and I believe this is also a big part of YaRN as well as Peng et all) low frequency dimensions in RoPE fail to learn even the basic periodicity of the PE because training data does not even cover a full period at these dimensions.
  5. The way to "fix" these spectral damages is to feature engineer the encoding to include other subservient harmonic frequencies (and learn their amplitudes) for spectral leakage/distortion, and to clip/remove PE for low frequency dimensions altogether, especially considering there's a wealth of evidence (here and many other papers) that they tend to carry important information (some previous work IIRC even conjecture that RoPE creates an inductive bias to put more emphasis on these dimensions precisely because their PE basically amounts to a near-constant, but going way out of the learned lengths will cause catastrophic OOD)

I didn't see a graph/ablation for the contributions of Additional-Harmonics vs Zeroing-Low-Frequency on the passkey retrieval tasks however.

Edit: I forgot to mention, I really like the idea and the approach here. I am surprised that zeroing low frequencies had less impact on length generalization, I would've bet on the opposite.

And on the information theory side, does plain RoPE or FoPE learn/have an inductive bias for some form of ECCs? That'd be fascinating

·

I will address your points one by one:

  1. For (1), (2) and (3) you mentioned, I agree with you.

    • Theoretically, RoPE can achieve periodical encoding in each dimension, using the phase of the frequency components allocated to this dimension.
    • However, LMs fail to fit this periodical pattern as there are many other frequencies in each dimension. Without periodicity, the expected length generalization of RoPE is undermined.
  2. For (4) and (5) you mentioned, I have some other opinions.

    • As for why low-frequency dimensions are important, it is partially because low-frequency tends to represent more information (based on the basis hypothesis that high-frequency is difficult for perception in physical world)
    • As for why Phase-OOD in low-frequency dimensions leads to Position-OOD, we have some ablations in Sec 5.5. It is partially because the average position embedding of these dimensions is not zero, which brings positional bias to hinder positional generalization. This is a time-domain explanation, another explanation in frequency-domain can be seen in Sec 3.3.
    • As for why zero-out low-frequency is useful for generalization, it is partially because zero-frequency brings zero average position embedding and will not bring positional bias. Additionally, zero-frequency has the shortest and the longest period/wavelength mathematically. It can present long-term and short-term dependency at the same time.
  3. For the ablation in Passkey retrieval, I agree that it is important. But limited by our computation resources, we only did the ablations trained on Gutenberg Books. The Passkey retrieval performance on this dataset is too weak in all methods, and we will supplement the ablation results on C4 once we have enough GPUs.

  4. For the view in ECC (Error Correction Code?), I think it may be useful to estimate a bound of how many position can be contained in each model with a specific position embedding, but it may not answer how much a position embedding can help for length generalization?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.17739 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2412.17739 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.17739 in a Space README.md to link it from this page.

Collections including this paper 11