Quantum Entanglement and the Sentient Toaster: Revolutionizing LLM Training

#3
by mradermacher - opened

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

-rw------- 1 root root 509G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I assume that is in GB and not GiB. In which case 474 GiB might fit as we have 503 GiB of RAM (after subtracting RAM reserved for hardware) but would be extremely tight given the RAM required for context.

I'm downloading the Q6_K for snowflake - remember, it often scores better at the correct_token metric than the source model :) But if you insist on the Q8_0 we can do that as well.

Q6_K is fine for me. Q8_0 might not fit without offloading and it is unclear if offloading is even possible. I don't think it's worth using RPC if Q6_K fits. As a bonus there will be enough RAM left to let quantization tasks running if we do Q6_K. If you already have Q8_0 locally you should give it a try and see if it fits but if not Q6_K is fine for me.

I just checked and you do have it locally under /tmp/snowflake-arctic-instruct.Q8_0.gguf so please give it a try to see if it fits. I believe it should fit if nothing else is running as the model has such a small number of layers. If it doesn't fit use Q6_K instead.

474G Dec 7 13:01 snowflake-arctic-instruct.Q8_0.gguf

I'll try an offload of 1 and 0, then Q6. hopefully it does not crash.

I think you have to finish or kill the frozen quantisation tasks first. They are using a lot of reserved RAM (not cached RAM that can be taked away).

So, despite it listing both cpus, it only allocated something on cpu 0 (19GB). Otherwise, top says the process uses 435.6g, which is good, because I forgot to resume/stop the running quantize. I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

457.4g after warming up.

So, despite it listing both GPUs, it only allocated something on GPU0 (19GB)

llama.cpp uses booth GPUs for imatrix but only offloaded to one because you set -ngl 1 and it can only offload on a per-layer bases. Also ince when are quantisation tasks using the GPUs?

grafik.png

I'd say we can even quantize, and if I manipulate the job a bit more, we might even do small imatrix calculations.

I'm not so sure about that. Keep in mind that imatrix uses mmap memory that can be taken away by other processes like quantisation tasks that use reserved memory.

grafik.png

dstat shows a relatively high disk read rate so imatrix might now be streaming from SSD:

grafik.png

Yes it is clearly streaming from SSD now:

grafik.png

Once the quantisation tasks are interrupted it should work without SSD streaming again.

I should mention that there is no feedback for this pause on the status screen. I'll probably change how that is reported, too.

All pause flags are shown in the status header now:

last updated: 2025-01-19 13:42:01+0100 (1s) (imatrix.GPU-188a5143-db69-7058-63b5-f2f1d2354f91)

echo pause GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713
echo resume GPU-2d319a51-0089-c21c-e3eb-6d8ecf9991cc >/dev/tcp/10.28.1.1/16713

Thanks a lot for implementing this so quickly. This is awesome as I can now use one of the RTX 4090 GPUs without pausing the entire imatrix queue.

All pause flags are shown in the status header

That's perfect.

Thanks a lot for implementing this so quickly. This is awesome as I can now use one of the RTX 4090 GPUs without pausing the entire imatrix queue.

It's indeed great for the future, but so far, that wasn't holding us back. What causes stress to the queue right now is the sheer amount of models and big models that have been released in the last two weeks, limiting even progress of the low-priority models. But we are getting there :)

@nicoboss Tell me that you paused a gpu on nico1, because I am confused and don't know if I did it and forgot to resume ;)

In other news, I'm finished queuing evertything I'd ever wanted to queue from february to december last year. On to richard's list.

Tell me that you paused a gpu on nico1, because I am confused and don't know if I did it and forgot to resume ;)

I did pause the second GPU intentionally around an hour ago to give Guilherme34 the opportunity to test his new models. Guilherme34 needing some GPU resources today is the reasons why I asked for the single GPU pause feature to be implemented and I’m really glad to have it. I would usually give him the RTX 3080 but I’m currently using it myself.

What causes stress to the queue right now is the sheer amount of models and big models that have been released in the last two weeks, limiting even progress of the low-priority models. But we are getting there :)

Having many new exciting great models is awesome so don't worry about them delaying our progress on the low-priority ones. We will eventually get to them. The model backlog already reduced massively compared to our peak of over 4000 models.

In other news, I'm finished queuing evertything I'd ever wanted to queue from february to december last year. On to richard's list.

That's awesome to hear! We are making such great progress.

I did pause the second GPU intentionally

That's a relief :) I forgot about the timing and my command history from tetsinbg was a bit jumbled, so I really wasn't sure.

That's awesome to hear! We are making such great progress.

Yeah, and on to january and 2023 g

Sign up or log in to comment