-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fake HPU mode to Habana components #180
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the idea is great, but it introduces lots of conditional to our code (if not is_fake_hpu()).
I think it would be great if we could apply here monkey patching - similiar to GPU Migration: https://docs.habana.ai/en/latest/PyTorch/PyTorch_Model_Porting/GPU_Migration_Toolkit/GPU_Migration_Toolkit.html
In this case we could override all "hpu" modules with "pass" (do nothing) or "cpu", and then limit changes in our main HPU specific modules, as well as ease future development so there will be no need to add is_fake_hpu every time.
|
||
jobs: | ||
cputest: | ||
runs-on: ubuntu-latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be safer to use a hardcoded ubuntu version?
- habana_main | ||
pull_request: | ||
branches: | ||
- habana_main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about adding also habana_next? Just temporary until the time we maintain two branches
VLLM_TARGET_DEVICE=hpu python setup.py develop | ||
- name: cpu-test | ||
run: | | ||
VLLM_SKIP_WARMUP=true VLLM_PROMPT_SEQ_BUCKET_MAX=128 python examples/offline_inference_fakehpu.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running with warmup would be an additional bonus validation don't you think? Probably it would be better to limit number of buckets, so that it does not take that much time, instead of disabling warmup
@@ -100,6 +100,7 @@ def forward( | |||
kv_cache: torch.Tensor, | |||
attn_metadata: AttentionMetadata, | |||
) -> torch.Tensor: | |||
# import pdb; pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this comment is not needed
@@ -126,6 +131,11 @@ def determine_num_available_blocks(self) -> Tuple[int, int]: | |||
|
|||
# Execute a forward pass with dummy inputs to profile the memory usage | |||
# of the model. | |||
if is_fake_hpu(): | |||
# self.model_runner.profile_run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove commented code
irrelevant |
No description provided.