You are viewing main version, which requires installation from source. If you'd like
regular pip install, checkout the latest stable version (v0.7.0).
Use VLLM as backend
Lighteval allows you to use vllm
as backend allowing great speedups.
To use, simply change the model_args
to reflect the arguments you want to pass to vllm.
lighteval vllm \
"pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
"leaderboard|truthfulqa:mc|0|0"
vllm
is able to distribute the model across multiple GPUs using data
parallelism, pipeline parallelism or tensor parallelism.
You can choose the parallelism method by setting in the the model_args
.
For example if you have 4 GPUs you can split it across using tensor_parallelism
:
export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
"pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
"leaderboard|truthfulqa:mc|0|0"
Or, if your model fits on a single GPU, you can use data_parallelism
to speed up the evaluation:
lighteval vllm \
"pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
"leaderboard|truthfulqa:mc|0|0"
Available arguments for vllm
can be found in the VLLMModelConfig
:
- pretrained (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
- gpu_memory_utilisation (float): The fraction of GPU memory to use.
- revision (str): The revision of the model.
- dtype (str, None): The data type to use for the model.
- tensor_parallel_size (int): The number of tensor parallel units to use.
- data_parallel_size (int): The number of data parallel units to use.
- max_model_length (int): The maximum length of the model.
- swap_space (int): The CPU swap space size (GiB) per GPU.
- seed (int): The seed to use for the model.
- trust_remote_code (bool): Whether to trust remote code during model loading.
- add_special_tokens (bool): Whether to add special tokens to the input sequences.
- multichoice_continuations_start_space (bool): Whether to add a space at the start of each continuation in multichoice generation.
In the case of OOM issues, you might need to reduce the context size of the
model as well as reduce the gpu_memory_utilisation
parameter.