Lighteval documentation

Use VLLM as backend

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.7.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Use VLLM as backend

Lighteval allows you to use vllm as backend allowing great speedups. To use, simply change the model_args to reflect the arguments you want to pass to vllm.

lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0|0"

vllm is able to distribute the model across multiple GPUs using data parallelism, pipeline parallelism or tensor parallelism. You can choose the parallelism method by setting in the the model_args.

For example if you have 4 GPUs you can split it across using tensor_parallelism:

export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Or, if your model fits on a single GPU, you can use data_parallelism to speed up the evaluation:

lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"

Available arguments for vllm can be found in the VLLMModelConfig:

  • pretrained (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
  • gpu_memory_utilisation (float): The fraction of GPU memory to use.
  • revision (str): The revision of the model.
  • dtype (str, None): The data type to use for the model.
  • tensor_parallel_size (int): The number of tensor parallel units to use.
  • data_parallel_size (int): The number of data parallel units to use.
  • max_model_length (int): The maximum length of the model.
  • swap_space (int): The CPU swap space size (GiB) per GPU.
  • seed (int): The seed to use for the model.
  • trust_remote_code (bool): Whether to trust remote code during model loading.
  • add_special_tokens (bool): Whether to add special tokens to the input sequences.
  • multichoice_continuations_start_space (bool): Whether to add a space at the start of each continuation in multichoice generation.

In the case of OOM issues, you might need to reduce the context size of the model as well as reduce the gpu_memory_utilisation parameter.

< > Update on GitHub