Hey everyone ๐ค! Check out this new Virtual Try Off model (based on SD1.5): 1aurent/TryOffAnyone This model isn't as accurate as others (e.g. xiaozaa/cat-try-off-flux based on FLUX.1) but it sure is fast!
Coming back to Paris Friday to open our new Hugging Face office!
We're at capacity for the party but add your name in the waiting list as we're trying to privatize the passage du Caire for extra space for robots ๐ค๐ฆพ๐ฆฟ
The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work โฏ๏ธ
Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B they evaluate sampling strategies, scaling laws for models and datasets, video representation and more! > The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled ๐ scaling dataset has diminishing returns for smaller models > They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal > They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2 they find google/siglip-so400m-patch14-384 to be most powerful ๐ฅ > they also compare freezing different parts of models, training all stages with some frozen parts give the best yield
They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models ๐ฅ
After some heated discussion ๐ฅ, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐ฅ
Multimodal ๐ผ๏ธ > Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants ๐ > OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license โจ > Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs ๐ฌ > Meta released a new iteration of Llama 70B, Llama3.2-70B trained further > EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license ๐ฅ > Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license > Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models > Dataset: FineWeb2 just landed with multilinguality update! ๐ฅ nearly 8TB pretraining data in many languages!
Image/Video Generation ๐ผ๏ธ > Tencent released HunyuanVideo, a new photorealistic video generation model > OminiControl is a new editing/control framework for image generation models like Flux
Audio ๐ > Indic-Parler-TTS is a new text2speech model made by community
New InternVL drop with a state-of-the-art 78B vision language model with MIT license ๐ฅ https://huggingface.co/collections/OpenGVLab/internvl-25-673e1019b66e2218f68d7c1c The release comes with seven new vision LMs based on InternViT 300M/6B and Qwen2.5 (0.5B, 3B, 32B, 72B) and InternLM2 (8B, 7B, 20B) in different sizes 78B model is of InternViT 6B and Qwen2.5-72B Instruct, can accomplish variety of tasks ๐ Try here OpenGVLab/InternVL
Itโs 2nd of December , hereโs your Cyber Monday present ๐ !
Weโre cutting our price down on Hugging Face Inference Endpoints and Spaces!
Our folks at Google Cloud are treating us with a 40% price cut on GCP Nvidia A100 GPUs for the next 3๏ธโฃ months. We have other reductions on all instances ranging from 20 to 50%.
small but mighty ๐ฅ you can fine-tune SmolVLM on an L4 with batch size of 4 and it will only take 16.4 GB VRAM ๐ซฐ๐ป also with gradient accumulation simulated batch size is 16 โจ I made a notebook that includes all the goodies: QLoRA, gradient accumulation, gradient checkpointing with explanations on how they work ๐ https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):
- There will be the first major public protest related to AI - A big company will see its market cap divided by two or more because of AI - At least 100,000 personal AI robots will be pre-ordered - China will start to lead the AI race (as a consequence of leading the open-source AI race). - There will be big breakthroughs in AI for biology and chemistry. - We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.
How my predictions for 2024 turned out:
- A hyped AI company will go bankrupt or get acquired for a ridiculously low price โ (Inflexion, AdeptAI,...)
- Open-source LLMs will reach the level of the best closed-source LLMs โ with QwQ and dozens of others
- Big breakthroughs in AI for video, time-series, biology and chemistry โ for video ๐ดfor time-series, biology and chemistry
- We will talk much more about the cost (monetary and environmental) of AI โ Monetary ๐ดEnvironmental (๐ข)
- A popular media will be mostly AI-generated โ with NotebookLM by Google
- 10 millions AI builders on Hugging Face leading to no increase of unemployment ๐currently 7M of AI builders on Hugging Face
๐ผ๏ธ Multimodal > At Hugging Face we released SmolVLM, a performant and efficient smol vision language model ๐ > Show Lab released ShowUI-2B: new vision-language-action model to build GUI/web automation agents ๐ค > Rhymes AI has released the base model of Aria: Aria-Base-64K and Aria-Base-8K with their respective context length > ViDoRe team released ColSmolVLM: A new ColPali-like retrieval model based on SmolVLM > Dataset: Llava-CoT-o1-Instruct: new dataset labelled using Llava-CoT multimodal reasoning model๐ > Dataset: LLaVA-CoT-100k dataset used to train Llava-CoT released by creators of Llava-CoT ๐
๐ฌ LLMs > Qwen team released QwQ-32B-Preview, state-of-the-art open-source reasoning model, broke the internet ๐ฅ > AliBaba has released Marco-o1, a new open-source reasoning model ๐ฅ > NVIDIA released Hymba 1.5B Base and Instruct, the new state-of-the-art SLMs with hybrid architecture (Mamba + transformer)
โฏ๏ธ Image/Video Generation > Qwen2VL-Flux: new image generation model based on Qwen2VL image encoder, T5 and Flux for generation > Lightricks released LTX-Video, a new DiT-based video generation model that can generate 24 FPS videos at 768x512 res โฏ๏ธ > Dataset: Image Preferences is a new image generation preference dataset made with DIBT community effort of Argilla ๐ท๏ธ
Audio > OuteAI released OuteTTS-0.2-500M new multilingual text-to-speech model based on Qwen-2.5-0.5B trained on 5B audio prompt tokens
INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.
if you use Google Kubernetes Engine to host you ML workloads, I think this series of videos is a great way to kickstart your journey of deploying LLMs, in less than 10 minutes! Thank you @wietse-venema-demo !