A newer version of the Gradio SDK is available:
5.11.0
CPU
CPU support is obtained after installing two optional requirements.txt files. This does not preclude GPU support, just adds CPU support:
- Install base, langchain, and GPT4All, and python LLaMa dependencies:
git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
pip install -r requirements.txt # only do if didn't already do for GPU support, since windows needs --extra-index-url line
pip install -r reqs_optional/requirements_optional_langchain.txt
python -m nltk.downloader all # for supporting unstructured package
pip install -r reqs_optional/requirements_optional_gpt4all.txt
See GPT4All for details on installation instructions if any issues encountered.
- Change
.env_gpt4all
model name if desired.
model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin
For gptj
and gpt4all_llama
, you can choose a different model than our default choice by going to GPT4All Model explorer GPT4All-J compatible model. One does not need to download manually, the gp4all package will download at runtime and put it into .cache
like Hugging Face would. However, gpjt
model often gives no output, even outside h2oGPT.
So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from TheBloke. For example, 13B WizardLM Quantized or 7B WizardLM Quantized. TheBloke has a variety of model types, quantization bit depths, and memory consumption. Choose what is best for your system's specs. However, be aware that LLaMa-based models are not commercially viable.
For 7B case, download WizardLM-7B-uncensored.ggmlv3.q8_0.bin into local path. Then one sets model_path_llama
in .env_gpt4all
, which is currently the default.
- Run generate.py
For LangChain support using documents in user_path
folder, run h2oGPT like:
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path
See LangChain Readme for more details. For no langchain support (still uses LangChain package as model wrapper), run as:
python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None
When using llama.cpp
based CPU models, for computers with low system RAM or slow CPUs, we recommend adding to .env_gpt4all
:
use_mlock=False
n_ctx=1024
where use_mlock=True
is default to avoid slowness and n_ctx=2048
is default for large context handling. For computers with plenty of system RAM, we recommend adding to .env_gpt4all
:
n_batch=1024
for faster handling. On some systems this has no strong effect, but on others may increase speed quite a bit.
Also, for slow and low-memory systems, we recommend using a smaller embedding by using with generrate.py
:
python generate.py ... --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2
where ...
means any other options one should add like --base_model
etc. This simpler embedding is about half the size as default instruct-large
and so uses less disk, CPU memory, and GPU memory if using GPUs.
See also Low Memory for more information about low-memory recommendations.