codellama-13b-awq RuntimeError: CUDA error: an illegal memory access was encountered
When I use the above method for inference with Codellama, I encounter CUDA kernel errors. Please help me understand why?
WARNING: WatchFiles detected changes in 'fastapi_vllm_codellama.py'. Reloading...
INFO 10-31 16:58:55 llm_engine.py:72] Initializing an LLM engine with config: model='./CodeLlama-13B-AWQ', tokenizer='./CodeLlama-13B-AWQ', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
INFO 10-31 16:58:55 tokenizer.py:30] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Process SpawnProcess-46:
Traceback (most recent call last):
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/_subprocess.py", line 76, in subprocess_started
target(sockets=sockets)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 61, in run
return asyncio.run(self.serve(sockets=sockets))
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/server.py", line 68, in serve
config.load()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/config.py", line 473, in load
self.loaded_app = import_from_string(self.app)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/uvicorn/importer.py", line 21, in import_from_string
module = importlib.import_module(module_str)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/mnt/gpu/code/fastapi_vllm_codellama.py", line 22, in
llm = LLM(model="./CodeLlama-13B-AWQ", quantization="awq")
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 89, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 229, in from_engine_args
engine = cls(*engine_configs,
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 111, in init
self._init_cache()
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 191, in _init_cache
num_blocks = self._run_workers(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 692, in _run_workers
output = executor(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/worker/worker.py", line 109, in profile_num_available_blocks
self.model(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 297, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
hidden_states = layer(
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 216, in forward
hidden_states = self.mlp(hidden_states)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 81, in forward
gate_up, _ = self.gate_up_proj(x)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 238, in forward
output_parallel = self.apply_weights(input_parallel, bias)
File "/mnt/gpu/code/miniconda/envs/code/lib/python3.10/site-packages/vllm/model_executor/layers/quantized_linear/awq.py", line 55, in apply_weights
out = quantization_ops.awq_gemm(reshaped_x, self.qweight, self.scales,
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.