From what I have seen, most token classification models out there have max token lengths less than 1k. Are there any models out there that can be used (i.e. customized) to be used with very long texts (long-form documents?
Assuming a model’s max token length is customizable, I assume its memory footprint has to be light for it to be able to batch a large number of embeddings&weights in GPU memory?
Any help/recommendation would be greatly appreciated in tackling this problem.
Most models have a 512 tokens limit and cannot extrapolate to longer sequences.
Memory footprint also increases quadratically with sequence length because standard attention is O(n²).
Best way to handle long sequences is to use a custom attention mecanism.
You can try this repo with a small model and a small block size, you should be able to process 16k tokens sequences.
The BART model goes up to 1024 tokens.
Then there are models which can take up to 16k tokens but they’re more custom and not always available out of the box on HuggingFace. One of these is the Longformer for example. Their model can be accessed via HuggingFace as shown here. You may also want to take a look at this recent paper from Google. It is a model specific for text generation (not exactly classification as you asked, but gives you an idea for what’s possible) and they have also made their code available (you can see more details here and here - there is still an open PR which will be merged into the main HuggingFace branch soon, so right now you’d have to take their code from the fork)
Best way to handle long sequences is to use a custom attention mecanism.
Is there a specific reason that you didn’t recommend using earlier RNN-based models? Since they don’t have an attention mechanism, their memory footprint should theoretically be linear to the sequence lenght, right?