Using CPU only
By default GGML does only use the CPU. To enable GPU offloading, add the -ngl X
argument, where X is the number of layers to offload to GPU.
With an A6000 you should have enough VRAM to offload all the layers in a 30B/33B model, so enter -ngl 60
.
You should see much better performance!
I should mention this in the README and will do soon.
I have a weird question or concern
By default GGML does only use the CPU. To enable GPU offloading, add the
-ngl X
argument, where X is the number of layers to offload to GPU.With an A6000 you should have enough VRAM to offload all the layers in a 30B/33B model, so enter
-ngl 60
.You should see much better performance!
I should mention this in the README and will do soon.
I have a question or concern.
I'n Ooba I can only get GGML models to run if I don't use arguments. Like I run them with the default loader choosing a number of which model to load. And I just settings. Do I still have to use -ngl X? Or how do I enable that in settings. Keep in mind I already followed the steps to make GPU acceleration to work.
https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md
However it still does not work. I know other people having that issue too. But if it's something simple like I need to use -ngl X I need to know.
No there's no such thing as -ngl in text-gen-ui. In text-gen you use the layers slider in the UI.
I answered in your other discussion regarding the issues getting llama-cpp-python compiled with GPU support.