|
### Containerized Installation for Inference on Linux GPU Servers |
|
|
|
1. Ensure docker installed and ready (requires sudo), can skip if system is already capable of running nvidia containers. Example here is for Ubuntu, see [NVIDIA Containers](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) for more examples. |
|
|
|
```bash |
|
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ |
|
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ |
|
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ |
|
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ |
|
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list |
|
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit-base |
|
sudo apt install nvidia-container-runtime |
|
sudo nvidia-ctk runtime configure --runtime=docker |
|
sudo systemctl restart docker |
|
``` |
|
|
|
2. Build the container image: |
|
|
|
```bash |
|
docker build -t h2ogpt . |
|
``` |
|
|
|
3. Run the container (you can also use `finetune.py` and all of its parameters as shown above for training): |
|
|
|
For the fine-tuned h2oGPT with 20 billion parameters: |
|
```bash |
|
docker run --runtime=nvidia --shm-size=64g -p 7860:7860 \ |
|
-v ${HOME}/.cache:/root/.cache --rm h2ogpt -it generate.py \ |
|
--base_model=h2oai/h2ogpt-oasst1-512-20b |
|
``` |
|
|
|
if have a private HF token, can instead run: |
|
```bash |
|
docker run --runtime=nvidia --shm-size=64g --entrypoint=bash -p 7860:7860 \ |
|
-e HUGGINGFACE_API_TOKEN=<HUGGINGFACE_API_TOKEN> \ |
|
-v ${HOME}/.cache:/root/.cache --rm h2ogpt -it \ |
|
-c 'huggingface-cli login --token $HUGGINGFACE_API_TOKEN && python3.10 generate.py --base_model=h2oai/h2ogpt-oasst1-512-20b --use_auth_token=True' |
|
``` |
|
|
|
For your own fine-tuned model starting from the gpt-neox-20b foundation model for example: |
|
```bash |
|
docker run --runtime=nvidia --shm-size=64g -p 7860:7860 \ |
|
-v ${HOME}/.cache:/root/.cache --rm h2ogpt -it generate.py \ |
|
--base_model=EleutherAI/gpt-neox-20b \ |
|
--lora_weights=h2ogpt_lora_weights --prompt_type=human_bot |
|
``` |
|
|
|
4. Open `https://localhost:7860` in the browser |
|
|
|
### Docker Compose Setup & Inference |
|
|
|
1. (optional) Change desired model and weights under `environment` in the `docker-compose.yml` |
|
|
|
2. Build and run the container |
|
|
|
```bash |
|
docker-compose up -d --build |
|
``` |
|
|
|
3. Open `https://localhost:7860` in the browser |
|
|
|
4. See logs: |
|
|
|
```bash |
|
docker-compose logs -f |
|
``` |
|
|
|
5. Clean everything up: |
|
|
|
```bash |
|
docker-compose down --volumes --rmi all |
|
``` |
|
|
|
|