Triton Inference Server
To get optimal performance for inference for h2oGPT models, we will be using the FastTransformer Backend for Triton.
Make sure to install Docker first.
Build Docker image for Triton with FasterTransformer backend:
git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
git clone https://github.com/NVIDIA/FasterTransformer.git
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.12
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
docker build --rm \
--build-arg TRITON_VERSION=${CONTAINER_VERSION} \
-t ${TRITON_DOCKER_IMAGE} \
-f docker/Dockerfile \
.
Create model definition files
We convert the h2oGPT model from HF to FT format:
Fetch model from Hugging Face
export MODEL=h2ogpt-oig-oasst1-512-6_9b
if [ ! -d ${MODEL} ]; then
git lfs clone https://huggingface.co/h2oai/${MODEL}
fi
If git lfs
fails, make sure to install it first. For Ubuntu:
sudo apt-get install git-lfs
Convert to FasterTransformer format
export WORKSPACE=$(pwd)
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}
# Go into Docker
docker run -it --rm --runtime=nvidia --shm-size=1g \
--ulimit memlock=-1 -v ${WORKSPACE}:${WORKSPACE} \
-e CUDA_VISIBLE_DEVICES=0 \
-e MODEL=${MODEL} \
-e WORKSPACE=${WORKSPACE} \
-w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
export PYTHONPATH=${WORKSPACE}/FasterTransformer/:$PYTHONPATH
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/utils/huggingface_gptneox_convert.py \
-i_g 1 \
-m_n gptneox \
-i ${WORKSPACE}/${MODEL} \
-o ${WORKSPACE}/FT-${MODEL}
Test the FasterTransformer model
FIXME
echo "Hi, who are you?" > gptneox_input
echo "And you are?" >> gptneox_input
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptneox/gptneox_example.py \
--ckpt_path ${WORKSPACE}/FT-${MODEL}/1-gpu \
--tokenizer_path ${WORKSPACE}/${MODEL} \
--sample_input_file gptneox_input
Update Triton configuration files
Fix a typo in the example:
sed -i -e 's@postprocessing@preprocessing@' all_models/gptneox/preprocessing/config.pbtxt
Update the path to the PyTorch model, and set to use 1 GPU:
sed -i -e "s@/workspace/ft/models/ft/gptneox/@${WORKSPACE}/FT-${MODEL}/1-gpu@" all_models/gptneox/fastertransformer/config.pbtxt
sed -i -e 's@string_value: "2"@string_value: "1"@' all_models/gptneox/fastertransformer/config.pbtxt
Launch Triton
CUDA_VISIBLE_DEVICES=0 mpirun -n 1 \
--allow-run-as-root /opt/tritonserver/bin/tritonserver \
--model-repository=${WORKSPACE}/all_models/gptneox/ &
Now, you should see something like this:
+-------------------+---------+--------+
| Model | Version | Status |
+-------------------+---------+--------+
| ensemble | 1 | READY |
| fastertransformer | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
+-------------------+---------+--------+
which means the pipeline is ready to make predictions!
Run client test
Let's test the endpoint:
python3 ${WORKSPACE}/tools/gpt/identity_test.py
And now the end-to-end test:
We first have to fix a bug in the inputs for postprocessing:
sed -i -e 's@prepare_tensor("RESPONSE_INPUT_LENGTHS", output2, FLAGS.protocol)@prepare_tensor("sequence_length", output1, FLAGS.protocol)@' ${WORKSPACE}/tools/gpt/end_to_end_test.py
python3 ${WORKSPACE}/tools/gpt/end_to_end_test.py