Algea-VE: A Tiny Multimodal Language Model with Only 0.8B Parameters

Algea-ve is trained on the LAION-CC-SBU dataset using algea-550M-base as the base model and fine-tuned on llava_v1_5_mix665k. It uses CLIP ViT-L/14-336 as the visual encoder. The model is very small, requiring only 32GB of VRAM for fine-tuning and 3GB for inference.

Due to insufficient training of the base model, the current model has some issues with hallucinations and repetition. To address this, I am training a new model that will maintain the same size but offer better performance.

This model is built based on the llavaphi project. To use the model, please click here.

Downloads last month
46
Safetensors
Model size
862M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.