Demo Crafters π€
community
AI & ML interests
None defined yet.
demo-crafters's activity
Post
1719
Everything that happened this week in open AI, a recap π€
merve/jan-17-releases-678a673a9de4a4675f215bf5
π Multimodal
- MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB
(vision, speech and text!)
- VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448
- ByteDance released larger SA2VA that comes in 26B parameters
- Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
π¬ LLMs
- MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens π€―
- Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B
- kyutai released Helium-1-Preview-2B is a new small multilingual LM
- Wayfarer-12B is a new LLM able to write D&D π§π»ββοΈ
- ReaderLM-v2 is a new HTML parsing model by Jina AI
- Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder
- Unsloth released Phi-4, faster and memory efficient Llama 3.3
πΌοΈ Vision
- MatchAnything is a new foundation model for matching
- FitDit is a high-fidelity VTON model based on DiT architecture
π£οΈ Audio
- OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
π Retrieval
- lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages
- cde-small-v2 is a new sota small retrieval model by
@jxm
π Multimodal
- MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB
(vision, speech and text!)
- VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448
- ByteDance released larger SA2VA that comes in 26B parameters
- Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
π¬ LLMs
- MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens π€―
- Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B
- kyutai released Helium-1-Preview-2B is a new small multilingual LM
- Wayfarer-12B is a new LLM able to write D&D π§π»ββοΈ
- ReaderLM-v2 is a new HTML parsing model by Jina AI
- Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder
- Unsloth released Phi-4, faster and memory efficient Llama 3.3
πΌοΈ Vision
- MatchAnything is a new foundation model for matching
- FitDit is a high-fidelity VTON model based on DiT architecture
π£οΈ Audio
- OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
π Retrieval
- lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages
- cde-small-v2 is a new sota small retrieval model by
@jxm
Post
1572
New smolagents example landed on Hugging Face cookbook π€
Learn how to create an inventory managing multi-agent system with smolagents, MongoDB and DeepSeek Chat π https://huggingface.co/learn/cookbook/mongodb_smolagents_multi_micro_agents
Learn how to create an inventory managing multi-agent system with smolagents, MongoDB and DeepSeek Chat π https://huggingface.co/learn/cookbook/mongodb_smolagents_multi_micro_agents
Post
3794
there's a new multimodal retrieval model in town π€
LlamaIndex released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM π¨
> shrinkable with matryoshka πͺ
> can do cross-lingual retrieval!
Collection: llamaindex/visual-document-retrieval-678151d19d2758f78ce910e1 (with models and datasets)
Demo: llamaindex/multimodal_vdr_demo
Learn more from their blog post here https://huggingface.co/blog/vdr-2b-multilingual π
LlamaIndex released vdr-2b-multi-v1
> uses 70% less image tokens, yet outperforming other dse-qwen2 based models
> 3x faster inference with less VRAM π¨
> shrinkable with matryoshka πͺ
> can do cross-lingual retrieval!
Collection: llamaindex/visual-document-retrieval-678151d19d2758f78ce910e1 (with models and datasets)
Demo: llamaindex/multimodal_vdr_demo
Learn more from their blog post here https://huggingface.co/blog/vdr-2b-multilingual π
Post
3549
What a beginning to this year in open ML π€
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714
Multimodal πΌοΈ
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook β 22k hours worth of samples from instruction videos π€―
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
LLMs π¬
> Microsoft released Phi-4, sota open-source 14B language model π₯
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B π¬π¬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct π
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview π
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs π
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences π©π»βπ»
Embeddings π
> @MoritzLaurer released zero-shot version of ModernBERT large π
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation β―οΈ
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts π₯
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
Let's unwrap! merve/jan-10-releases-677fe34177759de0edfc9714
Multimodal πΌοΈ
> ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts
> moondream2 is out with new capabilities like outputting structured data and gaze detection!
> Dataset: Alibaba DAMO lab released multimodal textbook β 22k hours worth of samples from instruction videos π€―
> Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
LLMs π¬
> Microsoft released Phi-4, sota open-source 14B language model π₯
> Dolphin is back with Dolphin 3.0 Llama 3.1 8B π¬π¬
> Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment
> SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct π
> Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview π
> Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs π
> Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences π©π»βπ»
Embeddings π
> @MoritzLaurer released zero-shot version of ModernBERT large π
> KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation β―οΈ
> NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts π₯
> Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!)
> Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others
> Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression
> Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
Post
1753
ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license π
ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093
> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos β―οΈ
> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM π¬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡οΈ
> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos β―οΈ
> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM π¬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ‡οΈ
> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.
Post
4787
supercharge your LLM apps with smolagents π₯
however cool your LLM is, without being agentic it can only go so far
enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!
Here's our blog for you to get started https://huggingface.co/blog/smolagents
however cool your LLM is, without being agentic it can only go so far
enter smolagents: a new agent library by Hugging Face to make the LLM write code, do analysis and automate boring stuff!
Here's our blog for you to get started https://huggingface.co/blog/smolagents
Post
4664
QwQ can see π₯
Qwen team released QvQ, a large vision LM with reasoning π±
it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo!
Check them out β¬οΈ
Demo Qwen/QVQ-72B-preview
Model Qwen/QVQ-72B-Preview
Read more https://qwenlm.github.io/blog/qvq-72b-preview/
Congratulations @JustinLin610 and team!
Qwen team released QvQ, a large vision LM with reasoning π±
it outperforms proprietary VLMs on several benchmarks, comes with open weights and a demo!
Check them out β¬οΈ
Demo Qwen/QVQ-72B-preview
Model Qwen/QVQ-72B-Preview
Read more https://qwenlm.github.io/blog/qvq-72b-preview/
Congratulations @JustinLin610 and team!
Post
5509
Google drops Gemini 2.0 Flash Thinking
a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more
now available in anychat, try it out: akhaliq/anychat
a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more
now available in anychat, try it out: akhaliq/anychat
Post
2799
Aya by Cohere For AI can now see! π
C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B π± works on 8 languages! π£οΈ
The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo
Dataset maya-multimodal/pretrain
Model maya-multimodal/maya π
kudos @nahidalam and team
C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B π± works on 8 languages! π£οΈ
The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo
Dataset maya-multimodal/pretrain
Model maya-multimodal/maya π
kudos @nahidalam and team
Post
3350
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models π§Ά
β¨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
β¨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench
The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work β―οΈ
Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled π scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful π₯
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield
They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models π₯
β¨ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
β¨ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench
The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work β―οΈ
Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled π scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful π₯
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield
They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models π₯
Post
1774
A complete RAG pipeline includes a reranker, which ranks the documents to find the best document π
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 π₯ https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 π₯ https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms
Post
5606
This week in open-source AI was insane π€ A small recapπΊπ»
merve/dec-6-releases-67545caebe9fc4776faac0a3
Multimodal πΌοΈ
> Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants π
> OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license β¨
> Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs π¬
> Meta released a new iteration of Llama 70B, Llama3.2-70B trained further
> EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license π₯
> Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license
> Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models
> Dataset: FineWeb2 just landed with multilinguality update! π₯ nearly 8TB pretraining data in many languages!
Image/Video Generation πΌοΈ
> Tencent released HunyuanVideo, a new photorealistic video generation model
> OminiControl is a new editing/control framework for image generation models like Flux
Audio π
> Indic-Parler-TTS is a new text2speech model made by community
Multimodal πΌοΈ
> Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants π
> OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license β¨
> Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs π¬
> Meta released a new iteration of Llama 70B, Llama3.2-70B trained further
> EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license π₯
> Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license
> Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models
> Dataset: FineWeb2 just landed with multilinguality update! π₯ nearly 8TB pretraining data in many languages!
Image/Video Generation πΌοΈ
> Tencent released HunyuanVideo, a new photorealistic video generation model
> OminiControl is a new editing/control framework for image generation models like Flux
Audio π
> Indic-Parler-TTS is a new text2speech model made by community
Post
1554
New InternVL drop with a state-of-the-art 78B vision language model with MIT license π₯ https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c
The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes
78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks π Try here OpenGVLab/InternVL
The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes
78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks π Try here OpenGVLab/InternVL
Post
2665
small but mighty π₯
you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM π«°π» also with gradient accumulation simulated batch size is 16 β¨
I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work π https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM π«°π» also with gradient accumulation simulated batch size is 16 β¨
I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work π https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Post
2908
Last week we were blessed with open-source models! A recap π
merve/nov-29-releases-674ccc255a57baf97b1e2d31
πΌοΈ Multimodal
> At Hugging Face we released SmolVLM, a performant and efficient smol vision language model π
> Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents π€
> Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length
> ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM
> Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning modelπ
> Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT π
π¬ LLMs
> Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet π₯
> AliBaba has released Marco-o1, a new open-source reasoning model π₯
> NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
β―οΈ Image/Video Generation
> Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation
> Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res β―οΈ
> Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla π·οΈ
Audio
> OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
merve/nov-29-releases-674ccc255a57baf97b1e2d31
πΌοΈ Multimodal
> At Hugging Face we released SmolVLM, a performant and efficient smol vision language model π
> Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents π€
> Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length
> ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM
> Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning modelπ
> Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT π
π¬ LLMs
> Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet π₯
> AliBaba has released Marco-o1, a new open-source reasoning model π₯
> NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
β―οΈ Image/Video Generation
> Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation
> Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res β―οΈ
> Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla π·οΈ
Audio
> OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
Post
6619
QwQ-32B-Preview is now available in anychat
A reasoning model that is competitive with OpenAI o1-mini and o1-preview
try it out: akhaliq/anychat
A reasoning model that is competitive with OpenAI o1-mini and o1-preview
try it out: akhaliq/anychat
Post
3827
Post
2193
The authors of ColPali trained a retrieval model based on SmolVLM π€
vidore/colsmolvlm-alpha
TLDR;
- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks
- ColSmolVLM is more memory efficient than ColQwen2 π
TLDR;
- ColSmolVLM performs better than ColPali and DSE-Qwen2 on all English tasks
- ColSmolVLM is more memory efficient than ColQwen2 π
Post
3917
Small yet mighty! π«
We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient π€
We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39
Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO π
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO π
We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient π€
We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39
Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO π
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO π