Misc models: 🦖T-Rex2, a very powerful object detection model for many applications https://github.com/IDEA-Research/T-Rex 👀 CT-RATE : A 3D dataset paired with text reports ibrahimhamamci/CT-RATE 🐙Octopus v2: a Gemma-based model trained for Android API - extremely fast, better than Llama+RAG, great results NexaAIDev/Octopus-v2
🌏Models and datasets around the world - Tess-70B, a MiQu-70B fine-tune with high-quality data migtissera/Tess-70B-v1.6 - UNI, a model trained on 100 million pathology images from 100k+ slides MahmoodLab/UNI - CONCH, a VLM trained on 1.17 million pathology image-text pairs MahmoodLab/CONCH
5. SpeechBrain 1.0: a toolkit with hundreds of recipes and pretrained models for audio-related tasks, such as speech recognition, diarization, and enhancement. New major release! HF repos: https://huggingface.co/speechbrain Website: https://speechbrain.github.io/
The community has struggled to do a good preference-tune of Gemma, so the amazing @lewtun and @philschmid built an open-source recipe and trained a model to help people get started.
Some interesting details - Fine-tuned on DEITA and DPOed with Argilla DPO dataset - Very strong MT Bench results (7.81), better than Zephyr Beta (mistral based) and Gemma Instruct - Can run locally with tools such as llama.cpp on a Mac - Not so good AGIEval results compared to mistral-based tunes - All training code is open-sourced - Trained for 105 minutes on 8x H100 - No system message
Big kudos to the team! Super exciting to see a good fine-tune for Gemma
The paper shows an adversarial attack strategy in which a user sends malicious queries that can affect the output of other user queries from the same batch.
So if in the same batch we have - User A benign query - User B malicious query The response for A might be altered!😱
How is this possible? One approach is to fill the token buffers with adversarial data, hence forcing the gating to use the non-ideal experts or to entirely drop the bening tokens (in the case of finite limit size).
This assumes that the adversary can use the model as a black-box but can observe the logit outputs + ensure that the data is always grouped in the same batch.
How to mitigate this? - Randomize batch order (and even run twice if some queries are very sensitive) - Use a large capacity slack - Sample from gate weights instead of top-k (not great IMO, as that require more memory for inference)
Current ranking of pre-trained (nonchat) open access LLMs according to the leaderboard. 1-4 are from China-based groups. Does training models with Chinese somehow lead to better metrics? 🤔 WDYT?