TinyLLaVA-Video

arXivGithub

Here, we introduce TinyLLaVA-Video-Qwen2.5-3B-16-512. For LLM and vision tower, we choose Qwen2.5-3B and siglip-so400m-patch14-384, respectively. The model samples 16 frames from each video and represents the video sequence using 512 tokens.

Result

VT (HF Path) #Frame/Query Video-MME MVBench LongVideoBench MLVU
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512 16/512 44.7 42.5 37.6 48.1
Zhang199/TinyLLaVA-Video-Phi2-16-512 16/512 42.7 42.0 42.2 46.5
Downloads last month
0
Safetensors
Model size
3.63B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.