Skip to content

Commit

Permalink
enable inference mode for deepspeed tp serving (#11742)
Browse files Browse the repository at this point in the history
  • Loading branch information
liu-shaojun authored Aug 8, 2024
1 parent 9e65cf0 commit 107f7aa
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion python/llm/example/GPU/Deepspeed-AutoTP-FastAPI/serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,11 +116,13 @@ def load_model(model_path, low_bit):
# Use IPEX-LLM `optimize_model` to convert the model into optimized low bit format
# Convert the rest of the model into float16 to reduce allreduce traffic
model = optimize_model(model.module.to(f"cpu"), low_bit=low_bit).to(torch.float16)

# Next, use XPU as accelerator to speed up inference
current_accel = XPU_Accelerator()
set_accelerator(current_accel)

model=model.eval()

# Move model back to xpu
model = model.to(f"xpu:{local_rank}")
model = BenchmarkWrapper(model)
Expand Down

0 comments on commit 107f7aa

Please sign in to comment.