I'm struggling making the Q2 version work on my M1 Max machine.
I'm struggling making the Q2 version work on my M1 Max machine.
Is 24GB VRAM too low to make this work? I have a 32GB RAM machine.
It loads fine if I give it 40 GPU layers, but then it's very slow, ~2 tok/sec
, and 16s of time to first token.
I quants run very slowly on Metal so that's probably why you're getting worse performance, I'll add the regular Q2_K version that should run better (but with no imatrix support)
I quants run very slowly on Metal so that's probably why you're getting worse performance, I'll add the regular Q2_K version that should run better (but with no imatrix support)
That's unfortunate, and certainly news to me. Anywhere I can read up on why this is?
Why I'm not entirely sure, but you can find the source at the very least here:
Use this for your 32GB Apple M1 Max machine:
sudo sysctl iogpu.wired_limit_mb=28672
Just keep in mind this will be using ~88% of your RAM (unified-memory)