How can we thank you enough, whale bros?

#1
by KrishnaKaasyap - opened

You guys along with Qwen, Mistral, and Meta are bearing the bulk weight of OSS like Atlas holding the Earth!

Sub-100B model with long reasoning (like 100k tokens thinking) with native tool use (text modality would be fine for the beginning) will be the greatest gift that y'all could give the OSS community as a new year gift for all us folks!

You guys seriously deserve a million GB200s.

Your work is much appreciated all over the world, including India. 💪🏼👏🏼

You guys along with Qwen, Mistral, and Meta are bearing the bulk weight of OSS like Atlas holding the Earth!

Sub-100B model with long reasoning (like 100k tokens thinking) with native tool use (text modality would be fine for the beginning) will be the greatest gift that y'all could give the OSS community as a new year gift for all us folks!

You guys seriously deserve a million GB200s.

Your work is much appreciated all over the world, including India. 💪🏼👏🏼

But they can't get the card, because ... CHINA!

--

Despite having an overall card level that is an order of magnitude lower than that of peers in other regions, they still achieved top-tier results. This is where these guys are truly impressive. What the world needs is competition and progress, not blockade and confrontation.

So the technological blockade by the United States on China's AI field has failed, right?

definitely

@faceair :
You do realize that this 685B model is 256 experts, 8 of which is active, so only 21B activated parameters over 8 experts, roughly 3B per expert. How expert can a 3B model be for the area of expertise? It is amazing that the coding performance is almost o1 level, but you do realize that all china AI models are nothing but stolen synthetics data from OpenAI/Anthropic, they just shove it into a model able to hold that many tokens.

OpenAI, Grok, even later LLAMA 4/5 are going for 2Trillion parameters (dense or activated ?) by next year. There is a huge difference when your model doesn't have free stolen synthetics from others that you have to do the grunt work to actually think inside the model instead of being told.

Knowing OpenAI stole their data, then everyone else "stole" OpenAI's and each others

You guys along with Qwen, Mistral, and Meta are bearing the bulk weight of OSS like Atlas holding the Earth!

Sub-100B model with long reasoning (like 100k tokens thinking) with native tool use (text modality would be fine for the beginning) will be the greatest gift that y'all could give the OSS community as a new year gift for all us folks!

You guys seriously deserve a million GB200s.

Your work is much appreciated all over the world, including India. 💪🏼👏🏼

If you read the technical report the MTP reasoning only goes one sequential token deep and can even be in their words from section 2.2 entirely discarded. This feels like an exploration into extremly sparse MoE models with optional MTP/reasoning.

So the technological blockade by the United States on China's AI field has failed, right?

totally and utterly failed
I repeat: totally and utterly

@faceair :
You do realize that this 685B model is 256 experts, 8 of which is active, so only 21B activated parameters over 8 experts, roughly 3B per expert. How expert can a 3B model be for the area of expertise? It is amazing that the coding performance is almost o1 level, but you do realize that all china AI models are nothing but stolen synthetics data from OpenAI/Anthropic, they just shove it into a model able to hold that many tokens.

OpenAI, Grok, even later LLAMA 4/5 are going for 2Trillion parameters (dense or activated ?) by next year. There is a huge difference when your model doesn't have free stolen synthetics from others that you have to do the grunt work to actually think inside the model instead of being told.

go back to 4chan containment zone

Still, no matter how huge and smart these models are, they still fail at easy questions like: Name some countries whose names end with "lia".
Example answer:
Here are some countries whose names end with "lia":
Australia
Estonia
Albania
Mongolia
Somalia
Nigeria
Austria (though it’s pronounced "Austria," it technically ends with "lia")
North Macedonia (formerly known as "Macedonia," which ends with "lia")
Let me know if you'd like more examples or details!

So, some questions make a model look really dumb.

@urtuuuu This is an inherent limitation of the tokenizer. It was a payoff: We prioritized higher speeds and lower RAM usage and lower training costs over being able to distinguish letters individually.
Many modern tokenizers see "Australia" as a single token, so the LLM can simply not tell if that word ends with "lia". It "remembered" that people say, that Australia ends with "lia", but it didn't reach that conclusionnon it's own.

Because this letter-wise analysis are very rarely of any real-world use, this was the correct optimization to take. The tokenizer makes LLMs feasible for consumer GPUs, reduces inference costs and makes training costs WAY more affordable and allows for actually reasonable inference speeds.
There are only very few advantages of using per-letter tokenizer:

  • Being able to do letter counting, "How many Rs are in 'Strawberry'" being the obvious example. No amount of reasoning will ever improve this.
  • Being able to interpret misspelled words better. Might be useful, but autocorrect exists, sooo yeaaa
  • Better interpretability of cryptic messages, perhaps extraction of patterns. Also not really that useful
  • The example you gave would work perfectly with a per-letter-tokenizer, but besides that, there are no major benefits.

Sign up or log in to comment