-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update serving image runtime #12433
update serving image runtime #12433
Conversation
I can't wait to try it |
Hi, we will build an image based on your code changes and benchmark the performance to determine whether the compute runtime upgrade affects it. |
Can I ask what engine you're seeing the best performance on on intel arc? |
follow your modify,it works. |
Do you mean serving engine? For serving engine, we currently only provide vllm engine. |
We have test awq and gptq model using AWQ:
from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="/llm/models/Llama-2-7B-Chat-AWQ/",
device="xpu",
dtype="float16",
enforce_eager=True,
quantization="AWQ",
load_in_low_bit="asym_int4",
tensor_parallel_size=1)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
GPTQ:
from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM
# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="/llm/models/Llama-2-7B-Chat-GPTQ/",
quantization="GPTQ",
load_in_low_bit="asym_int4",
device="xpu",
dtype="float16",
enforce_eager=True,
tensor_parallel_size=1)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi, we have verified that this change results in some performance improvements, so I will proceed to merge this PR. Thank you! |
Afaict ou offer at least
So far vllm seems the fastest but its excessive and permanent memory usage makes it unsuitable for desktop use. While Ollama dynamically loads and unloads models, but seems much slower. |
#!/bin/bash
model="/llm/models/Qwen2.5-72B-Instruct-AWQ"
served_model_name="Qwen2.5"
export CCL_WORKER_COUNT=4
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--block-size 8 \
--gpu-memory-utilization 0.9 \
--device xpu \
--dtype auto \
--enforce-eager \
--quantization awq \
--load-in-low-bit asym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--max-num-seqs 12 \
--tensor-parallel-size 4 \
--disable-async-output-proc \
--distributed-executor-backend ray 7B-Instruct works well on 4 Arc A770 Both 72B-Instuct-AWQ and 7B-Instruct-AWQ INFO 11-28 19:22:17 ray_gpu_executor.py:135] use_ray_spmd_worker: False
(pid=24957) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24957) warn(
(pid=24957) 2024-11-28 19:22:19,854 - INFO - intel_extension_for_pytorch auto imported
(pid=24961) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=24961) warn( [repeated 2x across cluster]
(pid=24961) 2024-11-28 19:22:25,489 - INFO - intel_extension_for_pytorch auto imported [repeated 2x across cluster]
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=24956) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
INFO 11-28 19:22:29 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x748d612befd0>, local_subscribe_port=48499, remote_subscribe_port=None)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:00<00:01, 9.94it/s]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:00<00:02, 3.62it/s]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:01<00:03, 2.20it/s]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:01<00:03, 1.75it/s]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:02<00:03, 1.58it/s]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:03<00:03, 1.52it/s]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:04<00:02, 1.49it/s]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:04<00:02, 1.47it/s]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:05<00:01, 1.45it/s]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:06<00:00, 1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.61it/s]
2024-11-28 19:22:36,396 - INFO - Converting the current model to asym_int4 format......
2024-11-28 19:22:36,396 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=24962) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24962) warn(
(pid=24962) 2024-11-28 19:22:28,313 - INFO - intel_extension_for_pytorch auto imported
2024-11-28 19:22:58,678 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:23:01,783 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:22:40,388 - INFO - Converting the current model to asym_int4 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:19,537 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:24,224 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:24:21,346 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 2x across cluster]
2024:11:28-19:24:27:(26920) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26924) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26927) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26930) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 4, byte_count: 65536000
*** buffer overflow detected ***: terminated
*** SIGABRT received at time=1732793067 on cpu 2 ***
PC: @ 0x748d6ec4e9fc (unknown) pthread_kill
@ 0x748d6ebfa520 (unknown) (unknown)
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: *** SIGABRT received at time=1732793067 on cpu 2 ***
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: PC: @ 0x748d6ec4e9fc (unknown) pthread_kill
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: @ 0x748d6ebfa520 (unknown) (unknown)
Fatal Python error: Aborted
Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, PIL._imaging, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imagingft, sentencepiece._sentencepiece, uvloop.loop, psutil._psutil_linux, psutil._psutil_posix, msgspec._core, regex._regex, msgpack._cmsgpack, google._upb._message, setproctitle, ray._raylet, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, zmq.backend.cython._zmq, pyarrow.lib, pyarrow._json (total: 48)
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 574, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 541, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 195, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
root@a5770-PA602-12900K:/llm# /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ' |
run this shell : #!/bin/bash
model="/llm/models/Qwen2.5-72B-Instruct"
served_model_name="Qwen2.5"
export CCL_WORKER_COUNT=4
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
source /opt/intel/1ccl-wks/setvars.sh
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name $served_model_name \
--port 8000 \
--model $model \
--trust-remote-code \
--block-size 8 \
--gpu-memory-utilization 0.9 \
--device xpu \
--dtype auto \
--enforce-eager \
--load-in-low-bit asym_int4 \
--max-model-len 2048 \
--max-num-batched-tokens 4000 \
--max-num-seqs 12 \
--tensor-parallel-size 4 \
--disable-async-output-proc \
--distributed-executor-backend ray It terminated here: INFO 11-28 19:22:17 ray_gpu_executor.py:135] use_ray_spmd_worker: False
(pid=24957) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24957) warn(
(pid=24957) 2024-11-28 19:22:19,854 - INFO - intel_extension_for_pytorch auto imported
(pid=24961) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=24961) warn( [repeated 2x across cluster]
(pid=24961) 2024-11-28 19:22:25,489 - INFO - intel_extension_for_pytorch auto imported [repeated 2x across cluster]
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=24956) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
INFO 11-28 19:22:29 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x748d612befd0>, local_subscribe_port=48499, remote_subscribe_port=None)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 9% Completed | 1/11 [00:00<00:01, 9.94it/s]
Loading safetensors checkpoint shards: 18% Completed | 2/11 [00:00<00:02, 3.62it/s]
Loading safetensors checkpoint shards: 27% Completed | 3/11 [00:01<00:03, 2.20it/s]
Loading safetensors checkpoint shards: 36% Completed | 4/11 [00:01<00:03, 1.75it/s]
Loading safetensors checkpoint shards: 45% Completed | 5/11 [00:02<00:03, 1.58it/s]
Loading safetensors checkpoint shards: 55% Completed | 6/11 [00:03<00:03, 1.52it/s]
Loading safetensors checkpoint shards: 64% Completed | 7/11 [00:04<00:02, 1.49it/s]
Loading safetensors checkpoint shards: 73% Completed | 8/11 [00:04<00:02, 1.47it/s]
Loading safetensors checkpoint shards: 82% Completed | 9/11 [00:05<00:01, 1.45it/s]
Loading safetensors checkpoint shards: 91% Completed | 10/11 [00:06<00:00, 1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00, 1.61it/s]
2024-11-28 19:22:36,396 - INFO - Converting the current model to asym_int4 format......
2024-11-28 19:22:36,396 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=24962) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24962) warn(
(pid=24962) 2024-11-28 19:22:28,313 - INFO - intel_extension_for_pytorch auto imported
2024-11-28 19:22:58,678 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:23:01,783 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:22:40,388 - INFO - Converting the current model to asym_int4 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:19,537 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:24,224 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:24:21,346 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 2x across cluster]
2024:11:28-19:24:27:(26920) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26924) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26927) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26930) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 4, byte_count: 65536000
*** buffer overflow detected ***: terminated
*** SIGABRT received at time=1732793067 on cpu 2 ***
PC: @ 0x748d6ec4e9fc (unknown) pthread_kill
@ 0x748d6ebfa520 (unknown) (unknown)
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: *** SIGABRT received at time=1732793067 on cpu 2 ***
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: PC: @ 0x748d6ec4e9fc (unknown) pthread_kill
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: @ 0x748d6ebfa520 (unknown) (unknown)
Fatal Python error: Aborted
Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, PIL._imaging, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imagingft, sentencepiece._sentencepiece, uvloop.loop, psutil._psutil_linux, psutil._psutil_posix, msgspec._core, regex._regex, msgpack._cmsgpack, google._upb._message, setproctitle, ray._raylet, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, zmq.backend.cython._zmq, pyarrow.lib, pyarrow._json (total: 48)
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 574, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 541, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 195, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
root@a5770-PA602-12900K:/llm# /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
^C
root@a5770-PA602-12900K:/llm# nano start-vllm-service.sh
root@a5770-PA602-12900K:/llm# nano start-vllm-service.sh
root@a5770-PA602-12900K:/llm# sh ./start-vllm-service.sh
./start-vllm-service.sh: 15: source: not found
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-11-28 19:31:23,559 - INFO - intel_extension_for_pytorch auto imported
INFO 11-28 19:31:24 api_server.py:529] vLLM API server version 0.6.2+ipexllm
INFO 11-28 19:31:24 api_server.py:530] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, load_in_low_bit='asym_int4', model='/llm/models/Qwen2.5-72B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=4000, max_num_seqs=12, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2.5'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-28 19:31:24 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/8235cd33-6d65-4f42-90d8-32100470dac3 for IPC Path.
INFO 11-28 19:31:24 api_server.py:180] Started engine process with PID 27089
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
2024-11-28 19:31:26,676 - INFO - intel_extension_for_pytorch auto imported
2024-11-28 19:31:28,616 INFO worker.py:1819 -- Started a local Ray instance.
INFO 11-28 19:31:29 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/llm/models/Qwen2.5-72B-Instruct', speculative_config=None, tokenizer='/llm/models/Qwen2.5-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 11-28 19:31:29 xpu_executor.py:91] bfloat16 is not fully supported on XPU, casting to float16.
INFO 11-28 19:31:29 ray_gpu_executor.py:135] use_ray_spmd_worker: False
(pid=27477) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=27477) warn(
(pid=27477) 2024-11-28 19:31:31,388 - INFO - intel_extension_for_pytorch auto imported
(pid=27474) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=27474) warn( [repeated 2x across cluster]
(pid=27474) 2024-11-28 19:31:36,957 - INFO - intel_extension_for_pytorch auto imported [repeated 2x across cluster]
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 11-28 19:31:40 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:31:40 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=27470) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=27470) INFO 11-28 19:31:40 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=27470) INFO 11-28 19:31:40 selector.py:138] Using IPEX attention backend.
INFO 11-28 19:31:40 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7a8816ac0590>, local_subscribe_port=50865, remote_subscribe_port=None)
INFO 11-28 19:31:40 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:31:40 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/37 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 3% Completed | 1/37 [00:00<00:15, 2.39it/s]
Loading safetensors checkpoint shards: 5% Completed | 2/37 [00:00<00:18, 1.94it/s]
Loading safetensors checkpoint shards: 8% Completed | 3/37 [00:01<00:16, 2.00it/s]
Loading safetensors checkpoint shards: 11% Completed | 4/37 [00:01<00:16, 2.04it/s]
Loading safetensors checkpoint shards: 14% Completed | 5/37 [00:02<00:17, 1.88it/s]
Loading safetensors checkpoint shards: 16% Completed | 6/37 [00:03<00:15, 1.96it/s]
Loading safetensors checkpoint shards: 19% Completed | 7/37 [00:03<00:14, 2.02it/s]
Loading safetensors checkpoint shards: 22% Completed | 8/37 [00:03<00:12, 2.29it/s]
Loading safetensors checkpoint shards: 24% Completed | 9/37 [00:04<00:13, 2.13it/s]
Loading safetensors checkpoint shards: 27% Completed | 10/37 [00:04<00:10, 2.51it/s]
Loading safetensors checkpoint shards: 30% Completed | 11/37 [00:05<00:10, 2.42it/s]
Loading safetensors checkpoint shards: 32% Completed | 12/37 [00:05<00:10, 2.27it/s]
Loading safetensors checkpoint shards: 35% Completed | 13/37 [00:06<00:11, 2.11it/s]
Loading safetensors checkpoint shards: 38% Completed | 14/37 [00:06<00:11, 1.99it/s]
Loading safetensors checkpoint shards: 41% Completed | 15/37 [00:07<00:11, 1.92it/s]
Loading safetensors checkpoint shards: 43% Completed | 16/37 [00:07<00:11, 1.87it/s]
Loading safetensors checkpoint shards: 46% Completed | 17/37 [00:08<00:10, 1.96it/s]
Loading safetensors checkpoint shards: 49% Completed | 18/37 [00:08<00:09, 2.01it/s]
Loading safetensors checkpoint shards: 51% Completed | 19/37 [00:09<00:08, 2.02it/s]
Loading safetensors checkpoint shards: 54% Completed | 20/37 [00:09<00:08, 2.06it/s]
Loading safetensors checkpoint shards: 57% Completed | 21/37 [00:10<00:08, 1.97it/s]
Loading safetensors checkpoint shards: 59% Completed | 22/37 [00:10<00:07, 1.91it/s]
Loading safetensors checkpoint shards: 62% Completed | 23/37 [00:11<00:07, 1.95it/s]
Loading safetensors checkpoint shards: 65% Completed | 24/37 [00:11<00:06, 1.89it/s]
Loading safetensors checkpoint shards: 68% Completed | 25/37 [00:12<00:06, 1.75it/s]
Loading safetensors checkpoint shards: 70% Completed | 26/37 [00:13<00:07, 1.46it/s]
Loading safetensors checkpoint shards: 73% Completed | 27/37 [00:14<00:06, 1.54it/s]
Loading safetensors checkpoint shards: 76% Completed | 28/37 [00:14<00:05, 1.59it/s]
Loading safetensors checkpoint shards: 78% Completed | 29/37 [00:15<00:04, 1.67it/s]
Loading safetensors checkpoint shards: 81% Completed | 30/37 [00:15<00:04, 1.74it/s]
Loading safetensors checkpoint shards: 84% Completed | 31/37 [00:16<00:03, 1.78it/s]
Loading safetensors checkpoint shards: 86% Completed | 32/37 [00:16<00:02, 1.79it/s]
Loading safetensors checkpoint shards: 89% Completed | 33/37 [00:17<00:02, 1.82it/s]
Loading safetensors checkpoint shards: 92% Completed | 34/37 [00:17<00:01, 1.90it/s]
Loading safetensors checkpoint shards: 95% Completed | 35/37 [00:18<00:01, 1.91it/s]
Loading safetensors checkpoint shards: 97% Completed | 36/37 [00:18<00:00, 1.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:19<00:00, 2.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:19<00:00, 1.93it/s]
2024-11-28 19:32:03,605 - INFO - Converting the current model to asym_int4 format......
2024-11-28 19:32:03,610 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:32:33,288 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:32:36,291 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=27474) 2024-11-28 19:32:48,903 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=27474) 2024-11-28 19:32:48,904 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=27463) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=27463) warn(
(pid=27463) 2024-11-28 19:31:39,752 - INFO - intel_extension_for_pytorch auto imported
(WrapperWithLoadBit pid=27463) 2024-11-28 19:32:49,131 - INFO - Converting the current model to asym_int4 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=27463) 2024-11-28 19:33:29,993 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=27463) 2024-11-28 19:33:35,236 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=27474) 2024-11-28 19:33:30,176 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 2x across cluster]
2024:11:28-19:33:37:(29434) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(29438) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(29443) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(29447) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 4, byte_count: 65536000
*** buffer overflow detected ***: terminated
*** SIGABRT received at time=1732793617 on cpu 4 ***
PC: @ 0x7a88304bd9fc (unknown) pthread_kill
@ 0x7a8830469520 (unknown) (unknown)
[2024-11-28 19:33:37,162 E 27089 29451] logging.cc:440: *** SIGABRT received at time=1732793617 on cpu 4 ***
[2024-11-28 19:33:37,162 E 27089 29451] logging.cc:440: PC: @ 0x7a88304bd9fc (unknown) pthread_kill
[2024-11-28 19:33:37,162 E 27089 29451] logging.cc:440: @ 0x7a8830469520 (unknown) (unknown)
Fatal Python error: Aborted |
Updates the serving docker image to match the inference cpp image, fixes #12372 detecting Intel Arc GPUs on newer host kernels for this particular image.
Maybe yall should consider building a base image with common dependencies and build images on top that install specific software?