Very impressive. Good world knowledge (SimpleQA of 25) despite high math/coding performance.

#27
by phil111 - opened

Qwen2.5 72b became unusable because of its huge drop in general knowledge and performance compared to Qwen2 72b (e.g. an English SimpleQA of only 9), so I was concerned DeepSeek was going to suffer the same drop in world knowledge after boosting math and coding performance, yet it was still able to achieve a SimpleQA of 25 (even GPT4o only scored ~40). Plus the Chinese SimpleQA is also high. Wish I had enough RAM to run this.

So is it safe to say that this is the best open weights LLM available at the moment?

@nlpguy I would be very surprised if DSv3 wasn't currently the best open weights LLM, although it's too large to run locally and I only spot checked it on LMsys.

In English it appears to be comparable to Llama 3.1 405b, but not near as good as GPT4o.

My biggest criticism is that they over-fit the English MMLU.

What I mean by this is that they trained far more on the small subset of popular English knowledge that's covered by the English MMLU (~ the same score as GPT4o), yet have far less broad English knowledge (25 vs 38 on SimpleQA, which is a huge difference because the remaining questions get progressively harder).

And they didn't do the same with Chinese knowledge (65 vs GPT4o's 60 on the Chinese SimpleQA). But even after this over-fitting it still has a comparable amount of broad English knowledge as L3.1 405b, and more academic MMLU knowledge, so it still appears to be at least as good as L3.1 405b in English.

Sign up or log in to comment