Bartowski's picture

Bartowski PRO

bartowski

AI & ML interests

Official model curator for https://lmstudio.ai/

Recent Activity

updated a model about 5 hours ago
bartowski/L3-Pneuma-8B-GGUF
updated a model about 7 hours ago
bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF
updated a model about 7 hours ago
bartowski/Qwen2-VL-7B-Instruct-abliterated-GGUF
View all activity

Organizations

LM Studio's profile picture Arcee AI's profile picture Qwen's profile picture Crystal Care AI's profile picture Retis Labs's profile picture NeuroLattice's profile picture Cognitive Computations's profile picture Top Contributors: Model Downloads's profile picture LM Studio Community's profile picture private beta for deeplinks's profile picture Arcee Training Org's profile picture open/ acc's profile picture

bartowski's activity

replied to their post 9 days ago
view reply

No it does not include the XS, the reason Q4_0 and IQ4_NL work i think is because they don't do any clever packing of the scaling factors, that's why K quants and IQ4_XS (which is like NL but with some K quant logic) don't work yet

replied to their post 11 days ago
view reply

oh, yeah, of course.. I added all the ARM quants but then not Q4_0 which is now the only one that would work haha..

I'll go any make a Q4_0 for it I suppose ! just this once

replied to their post 11 days ago
view reply

Don't love adding more formats but if your results are accurate it does seem worth including

replied to their post 14 days ago
view reply

I've updated it to "Legacy format, offers online repacking for ARM and AVX CPU inference.", it is still overall legacy but with the online repacking is worth considering for speed

I'm hoping that IQ4_NL gets a few more packing options in the near future

replied to their post 15 days ago
view reply

hell yeah. wish we could still offline compile, i get why it's not sustainable in the future but also until there's better support and more options would be nice to keep it around

replied to their post 24 days ago
replied to julien-c's post 24 days ago
view reply

This makes perfect sense, average users definitely don't need to be uploading that much stuff privately, great for testing but if it's not worth releasing publicly it's not worth storing on servers for free :)

Great update !

reacted to julien-c's post with ๐Ÿ”ฅโค๏ธ 24 days ago
view post
Post
7923
After some heated discussion ๐Ÿ”ฅ, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐Ÿ”ฅ

cc: @reach-vb @pierric @victor and the HF team
ยท
posted an update 24 days ago
view post
Post
11758
Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights
ยท
replied to nyuuzyou's post about 1 month ago
posted an update about 1 month ago
view post
Post
14746
Old mixtral model quants may be broken!

Recently Slaren over on llama.cpp refactored the model loader - in a way that's super awesome and very powerful - but with it came breaking of support for "split tensor MoE models", which applies to older mixtral models

You may have seen my upload of one such older mixtral model, ondurbin/bagel-dpo-8x7b-v0.2, and with the newest changes it seems to be able to run without issue

If you happen to run into issues with any other old mixtral models, drop a link here and I'll try to remake them with the new changes so that we can continue enjoying them :)
  • 2 replies
ยท
reacted to merve's post with โค๏ธ about 1 month ago
view post
Post
3134
your hugging face profile now has your recent activities ๐Ÿค—
replied to their post 3 months ago
view reply

The test mark was after initial upload and after people pointed it out :) glad it is a good label though

posted an update 3 months ago
view post
Post
23795
In regards to the latest mistral model and GGUFs for it:

Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window

Yes, I got excited when a conversion worked and released them ASAP

That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model

I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !
ยท
replied to their post 3 months ago
view reply

This argument really doesn't make any sense to me.. surely if you're aiming for the most accurate overall representation anyone can see that gathering as many data points across a diverse area would yield the most useful results? Sure ideally your single light will probably get a reasonably close overall value.. but also it might not?

Additionally, I think his point was that you don't necessarily want to increase performance against a given corpus, but rather increase faithfulness to the original model against a given corpus

You may be able to keep PPL the same or better than the original while simultaneously veering far from what the original model would have generated, which while great for that corpus of text, is not what the intention of the quantization itself is (in fact many people worry about this a lot, fearing that the quantization will favour the text used as a reference, which I'm luckily seeing is not what happens at least for imatrix)

The fact that 2 models can have identical PPL scores yet generate completely different text should be proof enough that PPL only tells a tiny part of a story. Yes it's good to know the model is good, but when quantizing I don't need to know how good it is, I need to know how similar it is to the original.

replied to their post 3 months ago
view reply

I suppose that's reasonable, I guess why I like KLD more is that I breaks it down into percentages, like mean, max, 99.99%, etc etc, where PPL is just a single all encompassing number that's more difficult to interpret

I don't know if I can put much value into IQ6 outperforming fp16 because lately we've been seeing benchmarks where Q3 beats bf16, so while useful I don't know that they can't definitively tell us quant quality, but I do think it's a good proof of competency

This is why KLD to me provides at least a slightly clearer image of how well the quantization does at recreating the original model. I see what you're saying still about PPL but (at least how llama.cpp does it) KLD gives a more thorough look. That and TOP p is nice to see how often the models agree on the token

replied to their post 3 months ago
view reply

That's not an invalid point, but also when the final goal is quantization that 0.03% is negligible compared to the rest of the losses.

If you're talking about running at full precision, yeah, bf16 > fp16 by all means

I'd also prefer to see KLD of fp16 vs bf16 since PPL is, to me, pretty meaningless. I'm sure it has value and probably more than I give it, but unless it's PPL against the dataset it was trained on I don't really find much merit to it.

I appreciate the breakdown though, and even 0.4% is not enough to worry me when again the final goal is quantization, not to run it at that DTYPE.

To that end, do you happen to know if when quantizing from BF16.. does it get converted to FP16 first? Does it even matter? BF16 -> Q8 vs BF16 -> FP16 -> Q8, I wonder how different it would be. Gut instinct says it's in the 0.01% range.

replied to their post 4 months ago
view reply

Bf16 can't be offloaded to GPUs so imatrix becomes slow to make :')

posted an update 4 months ago
view post
Post
34943
Reposting from twitter:

Just so you all know, I'll be on vacation for the following two weeks and away from home! I'm hoping to get on at least once a day to load up some quants, but I won't be as bleeding edge and on the ball :) feel free to shoot me a message if you see one I should make!

In the meantime if you need something bleeding edge make sure to check out @MaziyarPanahi or @bullerwins who both put out great work!
ยท