Do the distilled models also have 128K context?

#4
by Troyanovsky - opened

DeepSeek-R1 has 128K context length. Do the distilled models also have this context length or smaller?

This also depends on the models used for distillation themselves, but as long as those support it ( which is the case with Llama ), it should be fine

Sign up or log in to comment