Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
Abstract
Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.
Community
Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE’s limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention’s frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.
It is interesting that not only Attention influences length generalization, but Linear Layers and Activation Functions also play a role.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reversed Attention: On The Gradient Descent Of Attention Layers In GPT (2024)
- Length-Induced Embedding Collapse in Transformer-based Models (2024)
- When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training (2024)
- GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models (2024)
- Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models (2024)
- Retrofitting Large Language Models with Dynamic Tokenization (2024)
- CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks!
it would be interesting to see if this benefit tends to be more additive or multiplicative for context length.
If for example the model gets trained with max seq length 512 and FoPE makes it so the performance is still at 80% with context length 2048 (compared to context length 512). So FoPE could have either set the 80% performance mark to max_seq_length * 4 or to max_seq_length + 1536.
how does the 80% performance mark move when scaling up the max sequence length during training? If we train on 8k sequence length, will the 80% performance mark be around 32k or closer to 10k context length?
@iseesaw
@Messi-Hua
Is it possible for you to open source your training and data-processing code, as well as which data exactly you used for training and how many steps/epochs you trained for please?
I've been looking for a good baseline setup that i can go to when experimenting with new model architectures. If you were so kind to make your setup available, i could directly compare my experimental results to yours as a baseline
Sure, we will open source our code as soon as possible, likely within a week, as the main contributors of FoPE are currently in their exam period.
Our code is totally based on OLMo(https://github.com/allenai/OLMo), which already provides the training and data preprocessing code. The core code of FoPE and main setup is available in the appendix of our paper. If you'd like to reproduce the results immediately, you can also integrate FoPE directly into the OLMo repository.
I want to check my understanding of the ideas presented in this paper (I don't have a DSP background)
- RoPE can be (ideally) seen as attributing positional information encoded as a phase shift/spin of some multiple of predetermined frequencies associated with each dimension.
- Unfortunately, the operations both within attention and outside of it cause spectral leakage of these positional information (e.g. additive superposition of multiple dimensions mixed into the same dimension) and distortion due to nonlinearity as well.
- The hypothesis is that gradient based training is ill-suited to learn the true positional invariances/symmetries under these spectral damages. As such, it likely learns to overfit/memorize how to extract positional information based on seen-during-training data without learning to generalize.
- Additionally (and I believe this is also a big part of YaRN as well as Peng et all) low frequency dimensions in RoPE fail to learn even the basic periodicity of the PE because training data does not even cover a full period at these dimensions.
- The way to "fix" these spectral damages is to feature engineer the encoding to include other subservient harmonic frequencies (and learn their amplitudes) for spectral leakage/distortion, and to clip/remove PE for low frequency dimensions altogether, especially considering there's a wealth of evidence (here and many other papers) that they tend to carry important information (some previous work IIRC even conjecture that RoPE creates an inductive bias to put more emphasis on these dimensions precisely because their PE basically amounts to a near-constant, but going way out of the learned lengths will cause catastrophic OOD)
I didn't see a graph/ablation for the contributions of Additional-Harmonics vs Zeroing-Low-Frequency on the passkey retrieval tasks however.
Edit: I forgot to mention, I really like the idea and the approach here. I am surprised that zeroing low frequencies had less impact on length generalization, I would've bet on the opposite.
And on the information theory side, does plain RoPE or FoPE learn/have an inductive bias for some form of ECCs? That'd be fascinating
I will address your points one by one:
For (1), (2) and (3) you mentioned, I agree with you.
- Theoretically, RoPE can achieve periodical encoding in each dimension, using the phase of the frequency components allocated to this dimension.
- However, LMs fail to fit this periodical pattern as there are many other frequencies in each dimension. Without periodicity, the expected length generalization of RoPE is undermined.
For (4) and (5) you mentioned, I have some other opinions.
- As for why low-frequency dimensions are important, it is partially because low-frequency tends to represent more information (based on the basis hypothesis that high-frequency is difficult for perception in physical world)
- As for why Phase-OOD in low-frequency dimensions leads to Position-OOD, we have some ablations in Sec 5.5. It is partially because the average position embedding of these dimensions is not zero, which brings positional bias to hinder positional generalization. This is a time-domain explanation, another explanation in frequency-domain can be seen in Sec 3.3.
- As for why zero-out low-frequency is useful for generalization, it is partially because zero-frequency brings zero average position embedding and will not bring positional bias. Additionally, zero-frequency has the shortest and the longest period/wavelength mathematically. It can present long-term and short-term dependency at the same time.
For the ablation in Passkey retrieval, I agree that it is important. But limited by our computation resources, we only did the ablations trained on Gutenberg Books. The Passkey retrieval performance on this dataset is too weak in all methods, and we will supplement the ablation results on C4 once we have enough GPUs.
For the view in ECC (Error Correction Code?), I think it may be useful to estimate a bound of how many position can be contained in each model with a specific position embedding, but it may not answer how much a position embedding can help for length generalization?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper