docs/README_CPU.md · akashkj/H2OGPT at main

CPU

CPU support is obtained after installing two optional requirements.txt files. This does not preclude GPU support, just adds CPU support:

Install base, langchain, and GPT4All, and python LLaMa dependencies:

git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
pip install -r requirements.txt  # only do if didn't already do for GPU support, since windows needs --extra-index-url line
pip install -r reqs_optional/requirements_optional_langchain.txt
python -m nltk.downloader all  # for supporting unstructured package
pip install -r reqs_optional/requirements_optional_gpt4all.txt

See GPT4All for details on installation instructions if any issues encountered.

Change .env_gpt4all model name if desired.

model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin

For gptj and gpt4all_llama, you can choose a different model than our default choice by going to GPT4All Model explorer GPT4All-J compatible model. One does not need to download manually, the gp4all package will download at runtime and put it into .cache like Hugging Face would. However, gpjt model often gives no output, even outside h2oGPT.

So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from TheBloke. For example, 13B WizardLM Quantized or 7B WizardLM Quantized. TheBloke has a variety of model types, quantization bit depths, and memory consumption. Choose what is best for your system's specs. However, be aware that LLaMa-based models are not commercially viable.

For 7B case, download WizardLM-7B-uncensored.ggmlv3.q8_0.bin into local path. Then one sets model_path_llama in .env_gpt4all, which is currently the default.

Run generate.py

For LangChain support using documents in user_path folder, run h2oGPT like:

python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path

See LangChain Readme for more details. For no langchain support (still uses LangChain package as model wrapper), run as:

python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None

When using llama.cpp based CPU models, for computers with low system RAM or slow CPUs, we recommend adding to .env_gpt4all:

use_mlock=False
n_ctx=1024

where use_mlock=True is default to avoid slowness and n_ctx=2048 is default for large context handling. For computers with plenty of system RAM, we recommend adding to .env_gpt4all:

n_batch=1024

for faster handling. On some systems this has no strong effect, but on others may increase speed quite a bit.

Also, for slow and low-memory systems, we recommend using a smaller embedding by using with generrate.py:

python generate.py ... --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2

where ... means any other options one should add like --base_model etc. This simpler embedding is about half the size as default instruct-large and so uses less disk, CPU memory, and GPU memory if using GPUs.

See also Low Memory for more information about low-memory recommendations.