Do the distilled models also have 128K context?

by Troyanovsky - opened about 23 hours ago

about 23 hours ago

DeepSeek-R1 has 128K context length. Do the distilled models also have this context length or smaller?

about 15 hours ago

This also depends on the models used for distillation themselves, but as long as those support it ( which is the case with Llama ), it should be fine

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment