Use sample code to start error reporting
cheng@cheng:~$ python qwq.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 17/17 [00:10<00:00, 1.61it/s]
Traceback (most recent call last):
File "/home/cheng/qwq.py", line 24, in
generated_ids = model.generate(
^^^^^^^^^^^^^^^
File "/home/cheng/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/cheng/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/generation/utils.py", line 2252, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/home/cheng/anaconda3/envs/vllm/lib/python3.11/site-packages/transformers/generation/utils.py", line 3297, in _sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion probability tensor contains either
inf,
nan or element < 0
failed.
I've redownloaded the HF model and updated vllm to the latest version, 0.6.5, and it's working perfectly now. Thank you!
Although this model is experimental, the experience is undoubtedly the best in open-source currently, even surpassing closed-source options. Thank you to Alibaba Cloud for open-sourcing!