TinyLLaVA-Video

Here, we introduce TinyLLaVA-Video-Qwen2.5-3B-16-512. For LLM and vision tower, we choose Qwen2.5-3B and siglip-so400m-patch14-384, respectively. The model samples 16 frames from each video and represents the video sequence using 512 tokens.

Result

VT (HF Path)	#Frame/Query	Video-MME	MVBench	LongVideoBench	MLVU
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-16-512	16/512	44.7	42.5	37.6	48.1
Zhang199/TinyLLaVA-Video-Phi2-16-512	16/512	42.7	42.0	42.2	46.5

Downloads last month: 0

Safetensors

Model size

3.63B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.