You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using rocm 5.7 nightly to run serve or chat, the jit will crash the first time after downloading the weights and before outputting an md5-named lib.
To Reproduce
Steps to reproduce the behavior:
install latest rocm57 nightly
clear out any cache, then run serve or chat on any model (known supported and working)
the flow will crash
04:10:24] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[2024-04-15 04:10:26] INFO auto_device.py:85: Not found device: cuda:0
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:0
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:1
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:2
[2024-04-15 04:10:28] INFO auto_device.py:76: Found device: rocm:3
[2024-04-15 04:10:29] INFO auto_device.py:85: Not found device: metal:0
[2024-04-15 04:10:30] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-15 04:10:31] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-15 04:10:31] INFO auto_device.py:33: Using device: rocm:0
[2024-04-15 04:10:31] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gemma-2b-it-q4f16_1-MLC
[2024-04-15 04:10:31] INFO download.py:131: Weights already downloaded: /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC
[2024-04-15 04:10:31] INFO chat_module.py:781: Model lib not found. Now compiling model lib on device...
[2024-04-15 04:10:32] INFO jit.py:35: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-04-15 04:10:32] INFO jit.py:94: Compiling using commands below:
[2024-04-15 04:10:32] INFO jit.py:95: /usr/bin/python3 -m mlc_llm compile /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC --opt 'flashinfer=1;cublas_gemm=1;faster_transformer=1;cudagraph=0;cutlass=1;ipc_allreduce_strategy=NONE' --overrides 'context_window_size=8192;prefill_chunk_size=1024;tensor_parallel_shards=1' --device rocm:0 --output /tmp/tmpzldeenzs/lib.so
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[04:10:32] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[2024-04-15 04:10:33] INFO auto_config.py:69: Found model configuration: /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC/mlc-chat-config.json
[2024-04-15 04:10:33] INFO auto_target.py:84: Detecting target device: rocm:0
[2024-04-15 04:10:33] INFO auto_target.py:86: Found target: {"thread_warp_size": 64, "mtriple": "amdgcn-amd-amdhsa-hcc", "max_threads_per_block": 1024, "max_num_threads": 256, "kind": "rocm", "max_shared_memory_per_block": 65536, "tag": "", "mcpu": "gfx908", "keys": ["rocm", "gpu"]}
[2024-04-15 04:10:33] INFO auto_target.py:103: Found host LLVM triple: x86_64-unknown-linux-gnu
[2024-04-15 04:10:33] INFO auto_target.py:104: Found host LLVM CPU: skylake-avx512
[2024-04-15 04:10:33] INFO auto_config.py:153: Found model type: gemma. Use `--model-type` to override.
Compiling with arguments:
--config GemmaConfig(hidden_size=2048, hidden_act='gelu', intermediate_size=16384, attention_bias=False, num_attention_heads=8, num_key_value_heads=1, head_dim=256, num_hidden_layers=18, rms_norm_eps=1e-06, vocab_size=256000, position_embedding_base=10000.0, context_window_size=8192, prefill_chunk_size=1024, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type gemma
--target {"thread_warp_size": 64, "host": {"mtriple": "x86_64-unknown-linux-gnu", "tag": "", "kind": "llvm", "mcpu": "skylake-avx512", "keys": ["cpu"]}, "mtriple": "amdgcn-amd-amdhsa-hcc", "max_threads_per_block": 1024, "max_num_threads": 256, "kind": "rocm", "max_shared_memory_per_block": 65536, "tag": "", "mcpu": "gfx908", "keys": ["rocm", "gpu"]}
--opt flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output /tmp/tmpzldeenzs/lib.so
--overrides context_window_size=8192;sliding_window_size=None;prefill_chunk_size=1024;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-04-15 04:10:33] INFO config.py:106: Overriding context_window_size from 8192 to 8192
[2024-04-15 04:10:33] INFO config.py:106: Overriding prefill_chunk_size from 1024 to 1024
[2024-04-15 04:10:33] INFO config.py:106: Overriding tensor_parallel_shards from 1 to 1
[2024-04-15 04:10:33] INFO compile.py:137: Creating model from: GemmaConfig(hidden_size=2048, hidden_act='gelu', intermediate_size=16384, attention_bias=False, num_attention_heads=8, num_key_value_heads=1, head_dim=256, num_hidden_layers=18, rms_norm_eps=1e-06, vocab_size=256000, position_embedding_base=10000.0, context_window_size=8192, prefill_chunk_size=1024, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-04-15 04:10:33] INFO compile.py:156: Exporting the model to TVM Unity compiler
[2024-04-15 04:10:34] INFO compile.py:162: Running optimizations using TVM Unity
[2024-04-15 04:10:34] INFO compile.py:176: Registering metadata: {'model_type': 'gemma', 'quantization': 'q4f16_1', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 1024, 'tensor_parallel_shards': 1, 'kv_cache_bytes': 0}
[2024-04-15 04:10:35] INFO pipeline.py:50: Running TVM Relax graph-level optimizations
[2024-04-15 04:10:45] INFO pipeline.py:50: Lowering to TVM TIR kernels
[2024-04-15 04:10:46] INFO pipeline.py:50: Running TVM TIR-level optimizations
[2024-04-15 04:10:50] INFO pipeline.py:50: Running TVM Dlight low-level optimizations
[2024-04-15 04:10:52] INFO pipeline.py:50: Lowering to VM bytecode
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `alloc_embedding_tensor`: 4.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_decode`: 10.31 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_prefill`: 132.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `batch_verify`: 132.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `decode`: 0.13 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `embed`: 4.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `prefill`: 132.00 MB
[2024-04-15 04:10:53] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-15 04:10:54] INFO pipeline.py:50: Compiling external modules
[2024-04-15 04:10:54] INFO pipeline.py:50: Compilation complete! Exporting to disk
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 52, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 25, in main
cli.main(sys.argv[2:])
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/compile.py", line 128, in main
compile(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/compile.py", line 234, in compile
_compile(args, model_config)
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/compile.py", line 179, in _compile
args.build_func(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/support/auto_target.py", line 266, in build
relax.build(
File "/usr/local/lib/python3.10/dist-packages/tvm/relax/vm_build.py", line 341, in build
return _vmlink(
File "/usr/local/lib/python3.10/dist-packages/tvm/relax/vm_build.py", line 247, in _vmlink
lib = tvm.build(
File "/usr/local/lib/python3.10/dist-packages/tvm/driver/build_module.py", line 297, in build
rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
File "tvm/_ffi/_cython/./packed_func.pxi", line 56, in tvm._ffi._cy3.core.tvm_callback
File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/rocm.py", line 120, in callback_rocm_link
rocm_link(tmp_obj, tmp_cobj)
File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/rocm.py", line 85, in rocm_link
lld if lld is not None else find_lld()[0],
File "/usr/local/lib/python3.10/dist-packages/tvm/contrib/rocm.py", line 59, in find_lld
raise RuntimeError("cannot find ld.lld, candidates are: " + str(lld_list))
RuntimeError: cannot find ld.lld, candidates are: ['ld.lld-17.0', 'ld.lld-17', 'ld.lld', '/opt/rocm/llvm/bin']
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 772, in __init__
self.model_lib_path = _get_lib_module_path(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 591, in _get_lib_module_path
raise FileNotFoundError(err_msg)
FileNotFoundError: Cannot find the model library that corresponds to `None`.
`None` is either provided in the `chat_config` you passed in, or specified in /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC/mlc-chat-config.json.
We searched over the following possible paths:
- None-rocm.so
- dist/prebuilt/lib/None-rocm.so
- dist/HF://mlc-ai/gemma-2b-it-q4f16_1-MLC/None-rocm.so
- /root/.cache/mlc_llm/model_weights/mlc-ai/gemma-2b-it-q4f16_1-MLC/None-rocm.so
- /root/.cache/mlc_llm/model_weights/mlc-ai/None-rocm.so
If you would like to directly specify the model library path, you may consider passing in the `ChatModule.model_lib_path` parameter.
Please checkout https://github.com/mlc-ai/notebooks/blob/main/mlc-llm/tutorial_chat_module_getting_started.ipynb for an example on how to load a model.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/mlc_llm", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 37, in main
cli.main(sys.argv[2:])
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/chat.py", line 41, in main
chat(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/chat.py", line 133, in chat
cm = ChatModule(model, device, chat_config=config, model_lib_path=model_lib_path)
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 785, in __init__
jit.jit(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/jit.py", line 123, in jit
_run_jit(
File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/jit.py", line 96, in _run_jit
subprocess.run(cmd, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'mlc_llm', 'compile', '/root/.cache/mlc_llm/
Expected behavior
Simple invocation of flow should work, as it does with nvidia cuda 12.2 nightly build.
How you installed MLC-LLM (conda, source): nightly rocm57
How you installed TVM-Unity (pip, source): nightly rocm57
Python version (e.g. 3.10): 3.10
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
🐛 Bug
When using rocm 5.7 nightly to run
serve
orchat
, the jit will crash the first time after downloading the weights and before outputting an md5-named lib.To Reproduce
Steps to reproduce the behavior:
serve
orchat
on any model (known supported and working)Expected behavior
Simple invocation of flow should work, as it does with nvidia cuda 12.2 nightly build.
Environment
conda
, source): nightly rocm57pip
, source): nightly rocm57python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context
Likely due to this issue mlc-ai/relax#316
The text was updated successfully, but these errors were encountered: