Multinode operation
What an incredible model with enormous potential applications in my work as an historian! However, my limited technical background seems to be holding me back from running this effectively in my environment.
The server command at https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411 appears to assume a single-node setup, but I need to run this across multiple nodes (specifically, two or maybe three nodes, each with 4 NVIDIA A100 GPUs, providing 160GB of GPU memory per node). Despite my efforts, I’ve been struggling to get it working properly in this multi-node configuration.
I’ve successfully converted the Docker image into a .sif Apptainer file and downloaded the model to a local directory. However, I'm consistently running into issues usually after all shards are loaded (e.g. tried to allocate 2.93 GiB. GPU 0 has a total capacity of 39.50 GiB of which 106.12 MiB is free. or /usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '; ). Has anyone successfully deployed this model in a multi-node environment and might be able to offer some guidance? Any advice would be greatly appreciated!
Current attempt here....
#!/bin/bash -l
#SBATCH -A
#SBATCH -q default
#SBATCH -p gpu
#SBATCH --time=01:00:00
#SBATCH -N 3
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --gpus-per-task=4
#SBATCH --error="debug_local_model-%j.err"
#SBATCH --output="debug_local_model-%j.out"
setup
module --force purge
module load env/release/2023.1
module load Apptainer/1.3.1-GCCcore-12.3.0
pnix
export PMIX_MCA_psec=native
Local model
export LOCAL_MODEL_DIR="/workspace/models--mistralai--Pixtral-Large-Instruct-2411/snapshots/6aea62fd4e842bb7981339519accf19c7120ccd3"
sif
export SIF_IMAGE="vllm-openai.sif"
Apptainer
bind
export APPTAINER_ARGS="--nvccli -B /mnt/tier2/project/:/workspace"
Ray
export HEAD_HOSTNAME="$(hostname)"
export HEAD_IPADDRESS="$(hostname --ip-address)"
export RANDOM_PORT=$(python3 -c 'import socket; s=socket.socket(); s.bind(("",0)); print(s.getsockname()[1]); s.close()')
export RAY_CMD_HEAD="ray start --block --head --port=${RANDOM_PORT} --num-cpus=${SLURM_CPUS_PER_TASK} --verbose"
export RAY_CMD_WORKER="ray start --block --address=${HEAD_IPADDRESS}:${RANDOM_PORT} --num-cpus=${SLURM_CPUS_PER_TASK} --verbose"
export TENSOR_PARALLEL_SIZE=4
export PIPELINE_PARALLEL_SIZE=${SLURM_NNODES}
end the setup
LOGGING TO try and identify problems
echo "========== ENVIRONMENT VARIABLES =========="
env
echo "========== SLURM VARs =========="
echo "SLURM_JOBID: ${SLURM_JOBID}"
echo "SLURM_NNODES: ${SLURM_NNODES}"
echo "SLURM_NODELIST: ${SLURM_NODELIST}"
echo "SLURM_CPUS_PER_TASK: ${SLURM_CPUS_PER_TASK}"
echo "SLURM_GPUS_PER_TASK: ${SLURM_GPUS_PER_TASK}"
echo "========== NODE & GPU INFORMATION (HOST) =========="
srun -N ${SLURM_NNODES} -l hostname
srun -N ${SLURM_NNODES} -l nvidia-smi -L
echo "========== NODE & GPU INFORMATION (INSIDE APPTAINER) =========="
srun -N ${SLURM_NNODES} -l apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} nvidia-smi
echo "========== CHECKING MODEL DIRECTORY ACCESS INSIDE APPTAINER =========="
srun -N ${SLURM_NNODES} -l apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ls -l ${LOCAL_MODEL_DIR}
Additional check: print disk usage to confirm full model presence
srun -N 1 -l apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} du -sh ${LOCAL_MODEL_DIR}
START RAY
echo "========== STARTING RAY HEAD NODE =========="
srun -J "head_ray_node_step_%J" -N 1 --ntasks-per-node=1 -c $(( SLURM_CPUS_PER_TASK/2 )) -w ${HEAD_HOSTNAME} apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_HEAD} &
sleep 20
echo "========== STARTING RAY WORKERS =========="
srun -J "worker_ray_node_step_%J" -N $(( SLURM_NNODES-1 )) --ntasks-per-node=1 -c ${SLURM_CPUS_PER_TASK} -x ${HEAD_HOSTNAME} apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 30
TEST VLLM
echo "HEAD NODE: ${HEAD_HOSTNAME}"
echo "IP ADDRESS: ${HEAD_IPADDRESS}"
echo "RANDOM PORT (RAY): ${RANDOM_PORT}"
echo "SSH TUNNEL CMD: ssh -p 8822 ${USER}@login.lxp.lu -NL 8000:${HEAD_IPADDRESS}:8000"
echo "========== TESTING VLLM SERVE COMMAND LOCALLY =========="
Attempt to run vllm serve from local.
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
apptainer exec ${APPTAINER_ARGS} ${SIF_IMAGE} vllm serve ${LOCAL_MODEL_DIR}
--config-format mistral
--load-format mistral
--tokenizer_mode mistral
--limit_mm_per_prompt 'image=10'
--max-model-len 4096
--tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
--pipeline-parallel-size ${PIPELINE_PARALLEL_SIZE}
--port 8000
--host 0.0.0.0 \
echo "========== SCRIPT COMPLETE =========="