Here, we introduce TinyLLaVA-Video-Qwen2.5-3B-16-512. For LLM and vision tower, we choose Qwen2.5-3B and siglip-so400m-patch14-384, respectively. The model samples 16 frames from each video and represents the video sequence using 512 tokens.
Result
VT (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU |
---|---|---|---|---|---|
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512 | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 |
Zhang199/TinyLLaVA-Video-Phi2-16-512 | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 |
- Downloads last month
- 0
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.