update serving image runtime #12433

pepijndevos · 2024-11-23T13:01:33Z

Updates the serving docker image to match the inference cpp image, fixes #12372 detecting Intel Arc GPUs on newer host kernels for this particular image.

Maybe yall should consider building a base image with common dependencies and build images on top that install specific software?

HiddenPeak · 2024-11-24T19:56:39Z

I can't wait to try it

liu-shaojun · 2024-11-25T02:11:53Z

Hi, we will build an image based on your code changes and benchmark the performance to determine whether the compute runtime upgrade affects it.

pepijndevos · 2024-11-25T10:05:10Z

Can I ask what engine you're seeing the best performance on on intel arc?

HiddenPeak · 2024-11-25T19:12:49Z

follow your modify，it works.
but it can not run model with gptq or awq

liu-shaojun · 2024-11-26T01:32:57Z

Can I ask what engine you're seeing the best performance on on intel arc?

Do you mean serving engine? For serving engine, we currently only provide vllm engine.

ACupofAir · 2024-11-26T06:52:06Z

We have test awq and gptq model using Llama-2-7B-Chat-AWQ and /Llama-2-7B-Chat-GPTQ, the result showed that awq and gptq worked well. Here are the test script, more details refer to https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md#quantization:

AWQ:

script:

from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="/llm/models/Llama-2-7B-Chat-AWQ/",
          device="xpu",
          dtype="float16",
          enforce_eager=True,
          quantization="AWQ",
          load_in_low_bit="asym_int4",
          tensor_parallel_size=1)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

result:

GPTQ:

script:

from vllm import SamplingParams
from ipex_llm.vllm.xpu.engine import IPEXLLMClass as LLM

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="/llm/models/Llama-2-7B-Chat-GPTQ/",
          quantization="GPTQ",
          load_in_low_bit="asym_int4",
          device="xpu",
          dtype="float16",
          enforce_eager=True,
          tensor_parallel_size=1)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

result:

liu-shaojun

LGTM

liu-shaojun · 2024-11-26T06:54:16Z

Hi, we have verified that this change results in some performance improvements, so I will proceed to merge this PR. Thank you!

pepijndevos · 2024-11-26T07:11:46Z

Can I ask what engine you're seeing the best performance on on intel arc?

Do you mean serving engine? For serving engine, we currently only provide vllm engine.

Afaict ou offer at least

vllm
Ollama serve
llama.cpp server
fastchat ipex worker

So far vllm seems the fastest but its excessive and permanent memory usage makes it unsuitable for desktop use. While Ollama dynamically loads and unloads models, but seems much slower.

HiddenPeak · 2024-11-28T11:32:50Z

#!/bin/bash
model="/llm/models/Qwen2.5-72B-Instruct-AWQ"
served_model_name="Qwen2.5"

export CCL_WORKER_COUNT=4
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
 
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
 
source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.9 \
  --device xpu \
  --dtype auto \
  --enforce-eager \
  --quantization awq \
  --load-in-low-bit asym_int4 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4000 \
  --max-num-seqs 12 \
  --tensor-parallel-size 4 \
  --disable-async-output-proc \
  --distributed-executor-backend ray

7B-Instruct works well on 4 Arc A770

Both 72B-Instuct-AWQ and 7B-Instruct-AWQ
It stop here :

INFO 11-28 19:22:17 ray_gpu_executor.py:135] use_ray_spmd_worker: False
(pid=24957) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24957)   warn(
(pid=24957) 2024-11-28 19:22:19,854 - INFO - intel_extension_for_pytorch auto imported
(pid=24961) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=24961)   warn( [repeated 2x across cluster]
(pid=24961) 2024-11-28 19:22:25,489 - INFO - intel_extension_for_pytorch auto imported [repeated 2x across cluster]
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=24956) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
INFO 11-28 19:22:29 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x748d612befd0>, local_subscribe_port=48499, remote_subscribe_port=None)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   9% Completed | 1/11 [00:00<00:01,  9.94it/s]
Loading safetensors checkpoint shards:  18% Completed | 2/11 [00:00<00:02,  3.62it/s]
Loading safetensors checkpoint shards:  27% Completed | 3/11 [00:01<00:03,  2.20it/s]
Loading safetensors checkpoint shards:  36% Completed | 4/11 [00:01<00:03,  1.75it/s]
Loading safetensors checkpoint shards:  45% Completed | 5/11 [00:02<00:03,  1.58it/s]
Loading safetensors checkpoint shards:  55% Completed | 6/11 [00:03<00:03,  1.52it/s]
Loading safetensors checkpoint shards:  64% Completed | 7/11 [00:04<00:02,  1.49it/s]
Loading safetensors checkpoint shards:  73% Completed | 8/11 [00:04<00:02,  1.47it/s]
Loading safetensors checkpoint shards:  82% Completed | 9/11 [00:05<00:01,  1.45it/s]
Loading safetensors checkpoint shards:  91% Completed | 10/11 [00:06<00:00,  1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00,  1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00,  1.61it/s]

2024-11-28 19:22:36,396 - INFO - Converting the current model to asym_int4 format......
2024-11-28 19:22:36,396 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=24962) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24962)   warn(
(pid=24962) 2024-11-28 19:22:28,313 - INFO - intel_extension_for_pytorch auto imported
2024-11-28 19:22:58,678 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:23:01,783 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:22:40,388 - INFO - Converting the current model to asym_int4 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:19,537 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:24,224 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:24:21,346 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 2x across cluster]
2024:11:28-19:24:27:(26920) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26924) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26927) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26930) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 4, byte_count: 65536000
*** buffer overflow detected ***: terminated
*** SIGABRT received at time=1732793067 on cpu 2 ***
PC: @     0x748d6ec4e9fc  (unknown)  pthread_kill
    @     0x748d6ebfa520  (unknown)  (unknown)
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: *** SIGABRT received at time=1732793067 on cpu 2 ***
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: PC: @     0x748d6ec4e9fc  (unknown)  pthread_kill
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440:     @     0x748d6ebfa520  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, PIL._imaging, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imagingft, sentencepiece._sentencepiece, uvloop.loop, psutil._psutil_linux, psutil._psutil_posix, msgspec._core, regex._regex, msgpack._cmsgpack, google._upb._message, setproctitle, ray._raylet, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, zmq.backend.cython._zmq, pyarrow.lib, pyarrow._json (total: 48)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 574, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 541, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 195, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
root@a5770-PA602-12900K:/llm# /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

HiddenPeak · 2024-11-28T11:37:01Z

run this shell :

#!/bin/bash
model="/llm/models/Qwen2.5-72B-Instruct"
served_model_name="Qwen2.5"

export CCL_WORKER_COUNT=4
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
 
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
 
source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.9 \
  --device xpu \
  --dtype auto \
  --enforce-eager \
  --load-in-low-bit asym_int4 \
  --max-model-len 2048 \
  --max-num-batched-tokens 4000 \
  --max-num-seqs 12 \
  --tensor-parallel-size 4 \
  --disable-async-output-proc \
  --distributed-executor-backend ray

It terminated here:

INFO 11-28 19:22:17 ray_gpu_executor.py:135] use_ray_spmd_worker: False
(pid=24957) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24957)   warn(
(pid=24957) 2024-11-28 19:22:19,854 - INFO - intel_extension_for_pytorch auto imported
(pid=24961) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=24961)   warn( [repeated 2x across cluster]
(pid=24961) 2024-11-28 19:22:25,489 - INFO - intel_extension_for_pytorch auto imported [repeated 2x across cluster]
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=24956) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=24956) INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
INFO 11-28 19:22:29 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x748d612befd0>, local_subscribe_port=48499, remote_subscribe_port=None)
INFO 11-28 19:22:29 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:22:29 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   9% Completed | 1/11 [00:00<00:01,  9.94it/s]
Loading safetensors checkpoint shards:  18% Completed | 2/11 [00:00<00:02,  3.62it/s]
Loading safetensors checkpoint shards:  27% Completed | 3/11 [00:01<00:03,  2.20it/s]
Loading safetensors checkpoint shards:  36% Completed | 4/11 [00:01<00:03,  1.75it/s]
Loading safetensors checkpoint shards:  45% Completed | 5/11 [00:02<00:03,  1.58it/s]
Loading safetensors checkpoint shards:  55% Completed | 6/11 [00:03<00:03,  1.52it/s]
Loading safetensors checkpoint shards:  64% Completed | 7/11 [00:04<00:02,  1.49it/s]
Loading safetensors checkpoint shards:  73% Completed | 8/11 [00:04<00:02,  1.47it/s]
Loading safetensors checkpoint shards:  82% Completed | 9/11 [00:05<00:01,  1.45it/s]
Loading safetensors checkpoint shards:  91% Completed | 10/11 [00:06<00:00,  1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00,  1.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00,  1.61it/s]

2024-11-28 19:22:36,396 - INFO - Converting the current model to asym_int4 format......
2024-11-28 19:22:36,396 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=24956) 2024-11-28 19:22:40,076 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=24962) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=24962)   warn(
(pid=24962) 2024-11-28 19:22:28,313 - INFO - intel_extension_for_pytorch auto imported
2024-11-28 19:22:58,678 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:23:01,783 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:22:40,388 - INFO - Converting the current model to asym_int4 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:19,537 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=24956) 2024-11-28 19:24:24,224 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=24962) 2024-11-28 19:24:21,346 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 2x across cluster]
2024:11:28-19:24:27:(26920) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26924) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26927) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(26930) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:24:27:(24574) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 4, byte_count: 65536000
*** buffer overflow detected ***: terminated
*** SIGABRT received at time=1732793067 on cpu 2 ***
PC: @     0x748d6ec4e9fc  (unknown)  pthread_kill
    @     0x748d6ebfa520  (unknown)  (unknown)
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: *** SIGABRT received at time=1732793067 on cpu 2 ***
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440: PC: @     0x748d6ec4e9fc  (unknown)  pthread_kill
[2024-11-28 19:24:27,772 E 24574 26937] logging.cc:440:     @     0x748d6ebfa520  (unknown)  (unknown)
Fatal Python error: Aborted


Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, markupsafe._speedups, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, PIL._imaging, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imagingft, sentencepiece._sentencepiece, uvloop.loop, psutil._psutil_linux, psutil._psutil_posix, msgspec._core, regex._regex, msgpack._cmsgpack, google._upb._message, setproctitle, ray._raylet, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, zmq.backend.cython._zmq, pyarrow.lib, pyarrow._json (total: 48)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 574, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.11/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 541, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 105, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/entrypoints/openai/api_server.py", line 195, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
root@a5770-PA602-12900K:/llm# /usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
^C
root@a5770-PA602-12900K:/llm# nano start-vllm-service.sh 
root@a5770-PA602-12900K:/llm# nano start-vllm-service.sh 
root@a5770-PA602-12900K:/llm# sh ./start-vllm-service.sh 
./start-vllm-service.sh: 15: source: not found
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2024-11-28 19:31:23,559 - INFO - intel_extension_for_pytorch auto imported
INFO 11-28 19:31:24 api_server.py:529] vLLM API server version 0.6.2+ipexllm
INFO 11-28 19:31:24 api_server.py:530] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, load_in_low_bit='asym_int4', model='/llm/models/Qwen2.5-72B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=2048, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=8, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=4000, max_num_seqs=12, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=True, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='xpu', num_scheduler_steps=1, multi_step_stream_outputs=False, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2.5'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=True, override_neuron_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 11-28 19:31:24 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/8235cd33-6d65-4f42-90d8-32100470dac3 for IPC Path.
INFO 11-28 19:31:24 api_server.py:180] Started engine process with PID 27089
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2024-11-28 19:31:26,676 - INFO - intel_extension_for_pytorch auto imported
2024-11-28 19:31:28,616 INFO worker.py:1819 -- Started a local Ray instance.
INFO 11-28 19:31:29 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/llm/models/Qwen2.5-72B-Instruct', speculative_config=None, tokenizer='/llm/models/Qwen2.5-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
WARNING 11-28 19:31:29 xpu_executor.py:91] bfloat16 is not fully supported on XPU, casting to float16.
INFO 11-28 19:31:29 ray_gpu_executor.py:135] use_ray_spmd_worker: False
(pid=27477) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=27477)   warn(
(pid=27477) 2024-11-28 19:31:31,388 - INFO - intel_extension_for_pytorch auto imported
(pid=27474) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(pid=27474)   warn( [repeated 2x across cluster]
(pid=27474) 2024-11-28 19:31:36,957 - INFO - intel_extension_for_pytorch auto imported [repeated 2x across cluster]
observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
INFO 11-28 19:31:40 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:31:40 selector.py:138] Using IPEX attention backend.
(WrapperWithLoadBit pid=27470) observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
(WrapperWithLoadBit pid=27470) INFO 11-28 19:31:40 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
(WrapperWithLoadBit pid=27470) INFO 11-28 19:31:40 selector.py:138] Using IPEX attention backend.
INFO 11-28 19:31:40 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7a8816ac0590>, local_subscribe_port=50865, remote_subscribe_port=None)
INFO 11-28 19:31:40 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
INFO 11-28 19:31:40 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/37 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   3% Completed | 1/37 [00:00<00:15,  2.39it/s]
Loading safetensors checkpoint shards:   5% Completed | 2/37 [00:00<00:18,  1.94it/s]
Loading safetensors checkpoint shards:   8% Completed | 3/37 [00:01<00:16,  2.00it/s]
Loading safetensors checkpoint shards:  11% Completed | 4/37 [00:01<00:16,  2.04it/s]
Loading safetensors checkpoint shards:  14% Completed | 5/37 [00:02<00:17,  1.88it/s]
Loading safetensors checkpoint shards:  16% Completed | 6/37 [00:03<00:15,  1.96it/s]
Loading safetensors checkpoint shards:  19% Completed | 7/37 [00:03<00:14,  2.02it/s]
Loading safetensors checkpoint shards:  22% Completed | 8/37 [00:03<00:12,  2.29it/s]
Loading safetensors checkpoint shards:  24% Completed | 9/37 [00:04<00:13,  2.13it/s]
Loading safetensors checkpoint shards:  27% Completed | 10/37 [00:04<00:10,  2.51it/s]
Loading safetensors checkpoint shards:  30% Completed | 11/37 [00:05<00:10,  2.42it/s]
Loading safetensors checkpoint shards:  32% Completed | 12/37 [00:05<00:10,  2.27it/s]
Loading safetensors checkpoint shards:  35% Completed | 13/37 [00:06<00:11,  2.11it/s]
Loading safetensors checkpoint shards:  38% Completed | 14/37 [00:06<00:11,  1.99it/s]
Loading safetensors checkpoint shards:  41% Completed | 15/37 [00:07<00:11,  1.92it/s]
Loading safetensors checkpoint shards:  43% Completed | 16/37 [00:07<00:11,  1.87it/s]
Loading safetensors checkpoint shards:  46% Completed | 17/37 [00:08<00:10,  1.96it/s]
Loading safetensors checkpoint shards:  49% Completed | 18/37 [00:08<00:09,  2.01it/s]
Loading safetensors checkpoint shards:  51% Completed | 19/37 [00:09<00:08,  2.02it/s]
Loading safetensors checkpoint shards:  54% Completed | 20/37 [00:09<00:08,  2.06it/s]
Loading safetensors checkpoint shards:  57% Completed | 21/37 [00:10<00:08,  1.97it/s]
Loading safetensors checkpoint shards:  59% Completed | 22/37 [00:10<00:07,  1.91it/s]
Loading safetensors checkpoint shards:  62% Completed | 23/37 [00:11<00:07,  1.95it/s]
Loading safetensors checkpoint shards:  65% Completed | 24/37 [00:11<00:06,  1.89it/s]
Loading safetensors checkpoint shards:  68% Completed | 25/37 [00:12<00:06,  1.75it/s]
Loading safetensors checkpoint shards:  70% Completed | 26/37 [00:13<00:07,  1.46it/s]
Loading safetensors checkpoint shards:  73% Completed | 27/37 [00:14<00:06,  1.54it/s]
Loading safetensors checkpoint shards:  76% Completed | 28/37 [00:14<00:05,  1.59it/s]
Loading safetensors checkpoint shards:  78% Completed | 29/37 [00:15<00:04,  1.67it/s]
Loading safetensors checkpoint shards:  81% Completed | 30/37 [00:15<00:04,  1.74it/s]
Loading safetensors checkpoint shards:  84% Completed | 31/37 [00:16<00:03,  1.78it/s]
Loading safetensors checkpoint shards:  86% Completed | 32/37 [00:16<00:02,  1.79it/s]
Loading safetensors checkpoint shards:  89% Completed | 33/37 [00:17<00:02,  1.82it/s]
Loading safetensors checkpoint shards:  92% Completed | 34/37 [00:17<00:01,  1.90it/s]
Loading safetensors checkpoint shards:  95% Completed | 35/37 [00:18<00:01,  1.91it/s]
Loading safetensors checkpoint shards:  97% Completed | 36/37 [00:18<00:00,  1.92it/s]
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:19<00:00,  2.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:19<00:00,  1.93it/s]

2024-11-28 19:32:03,605 - INFO - Converting the current model to asym_int4 format......
2024-11-28 19:32:03,610 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:32:33,288 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-11-28 19:32:36,291 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=27474) 2024-11-28 19:32:48,903 - INFO - Converting the current model to asym_int4 format......
(WrapperWithLoadBit pid=27474) 2024-11-28 19:32:48,904 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
(pid=27463) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
(pid=27463)   warn(
(pid=27463) 2024-11-28 19:31:39,752 - INFO - intel_extension_for_pytorch auto imported
(WrapperWithLoadBit pid=27463) 2024-11-28 19:32:49,131 - INFO - Converting the current model to asym_int4 format...... [repeated 2x across cluster]
(WrapperWithLoadBit pid=27463) 2024-11-28 19:33:29,993 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 3x across cluster]
(WrapperWithLoadBit pid=27463) 2024-11-28 19:33:35,236 - INFO - Loading model weights took 10.1005 GB
(WrapperWithLoadBit pid=27474) 2024-11-28 19:33:30,176 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [repeated 2x across cluster]
2024:11:28-19:33:37:(29434) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(29438) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(29443) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(29447) |CCL_WARN| no membind support for NUMA node 0, skip thread membind
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2024:11:28-19:33:37:(27089) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
-----> current rank: 0, world size: 4, byte_count: 65536000
*** buffer overflow detected ***: terminated
*** SIGABRT received at time=1732793617 on cpu 4 ***
PC: @     0x7a88304bd9fc  (unknown)  pthread_kill
    @     0x7a8830469520  (unknown)  (unknown)
[2024-11-28 19:33:37,162 E 27089 29451] logging.cc:440: *** SIGABRT received at time=1732793617 on cpu 4 ***
[2024-11-28 19:33:37,162 E 27089 29451] logging.cc:440: PC: @     0x7a88304bd9fc  (unknown)  pthread_kill
[2024-11-28 19:33:37,162 E 27089 29451] logging.cc:440:     @     0x7a8830469520  (unknown)  (unknown)
Fatal Python error: Aborted

update serving image runtime

4a1f367

glorysdj requested a review from liu-shaojun November 25, 2024 01:50

liu-shaojun approved these changes Nov 26, 2024

View reviewed changes

liu-shaojun merged commit 71e1f11 into intel-analytics:main Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update serving image runtime #12433

update serving image runtime #12433

pepijndevos commented Nov 23, 2024

HiddenPeak commented Nov 24, 2024

liu-shaojun commented Nov 25, 2024

pepijndevos commented Nov 25, 2024

HiddenPeak commented Nov 25, 2024

liu-shaojun commented Nov 26, 2024

ACupofAir commented Nov 26, 2024

liu-shaojun left a comment

liu-shaojun commented Nov 26, 2024

pepijndevos commented Nov 26, 2024

HiddenPeak commented Nov 28, 2024

HiddenPeak commented Nov 28, 2024

update serving image runtime #12433

update serving image runtime #12433

Conversation

pepijndevos commented Nov 23, 2024

HiddenPeak commented Nov 24, 2024

liu-shaojun commented Nov 25, 2024

pepijndevos commented Nov 25, 2024

HiddenPeak commented Nov 25, 2024

liu-shaojun commented Nov 26, 2024

ACupofAir commented Nov 26, 2024

AWQ:

GPTQ:

liu-shaojun left a comment

Choose a reason for hiding this comment

liu-shaojun commented Nov 26, 2024

pepijndevos commented Nov 26, 2024

HiddenPeak commented Nov 28, 2024

HiddenPeak commented Nov 28, 2024