I was initially pretty sceptical about Meta's Coconut paper [1] because the largest perf gains were reported on toy linguistic problems. However, these results on machine translation are pretty impressive!
* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting. * Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases) * Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1 * Use the resulting data for SFT & RL * Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.
Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute ๐ฅ
How? By combining step-wise reward models with tree search algorithms :)
We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"
We're open sourcing the full recipe and sharing a detailed blog post.
In our blog post we cover:
๐ Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.
๐ Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.
๐งญ Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM
After some heated discussion ๐ฅ, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐ฅ
INTELLECT-1 is the first collaboratively trained 10 billion parameter language model trained from scratch on 1 trillion tokens of English text and code.
Llava o1 - vlm capable of spontaneous, systematic reasoning, similar to GPT-o1, 11B model outperforms gemini-1.5-pro, gpt-4o-mini, and llama-3.2-90B-vision Xkev/Llama-3.2V-11B-cot
Jina AI Jina CLIP v2 - general purpose multilingual and multimodal (text & image) embedding model, 900M params, 512 x 512 resolution, matroyoshka representations (1024 to 64) jinaai/jina-clip-v2
Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents Nexusflow/athene-v2-6735b85e505981a794fb02cc
Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed microsoft/orca-agentinstruct-1M-v1
Smol TTS models are here! OuteTTS-0.1-350M - Zero shot voice cloning, built on LLaMa architecture, CC-BY license! ๐ฅ
> Pure language modeling approach to TTS > Zero-shot voice cloning > LLaMa architecture w/ Audio tokens (WavTokenizer) > BONUS: Works on-device w/ llama.cpp โก
Three-step approach to TTS:
> Audio tokenization using WavTokenizer (75 tok per second) > CTC forced alignment for word-to-audio token mapping > Structured prompt creation w/ transcription, duration, audio tokens
The model is extremely impressive for 350M parameters! Kudos to the OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM ๐ค
> Trained with 1.3 trillion (dolma 1.7) tokens on 16 nodes, each with 4 MI250 GPUs
> Three checkpoints:
- AMD OLMo 1B: Pre-trained model - AMD OLMo 1B SFT: Supervised fine-tuned on Tulu V2, OpenHermes-2.5, WebInstructSub, and Code-Feedback datasets - AMD OLMo 1B SFT DPO: Aligned with human preferences using Direct Preference Optimization (DPO) on UltraFeedback dataset
Key Insights: > Pre-trained with less than half the tokens of OLMo-1B > Post-training steps include two-phase SFT and DPO alignment > Data for SFT: - Phase 1: Tulu V2 - Phase 2: OpenHermes-2.5, WebInstructSub, and Code-Feedback
> Model checkpoints on the Hub & Integrated with Transformers โก๏ธ
Congratulations & kudos to AMD on a brilliant smol model release! ๐ค
What a great day for Open Science! @AIatMeta released models, datasets, and code for many of its research artefacts! ๐ฅ
1. Meta Segment Anything Model 2.1: An updated checkpoint with improved results on visually similar objects, small objects and occlusion handling. A new developer suite will be added to make it easier for developers to build with SAM 2.