-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] Fix some kernels failed unit tests #2498
Conversation
cc @tjtanaa |
tests/kernels/test_pos_encoding.py
Outdated
@@ -26,6 +27,7 @@ | |||
@pytest.mark.parametrize("seed", SEEDS) | |||
@pytest.mark.parametrize("device", DEVICES) | |||
@torch.inference_mode() | |||
@skipIfRocm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing the discussion from
#2409 (comment)
Thanks for pointing out which tests failed in this case. We could handle this in a different PR?
I think resolving it in another PR could be better, if we still analysing the issue.
Based on my understanding, I remember fp16
and half
is supported as a native operator in ROCm whereas bfloat16
operations are mostly handled by casting it to float
as intermediate type, as shown in /opt/rocm/include/hip/amd_detail/amd_hip_bf16.h
https://github.com/ROCm/clr/blob/rocm-6.0.x/hipamd/include/hip/amd_detail/amd_hip_bf16.h
https://github.com/ROCm/clr/blob/rocm-6.0.x/hipamd/include/hip/amd_detail/amd_hip_bfloat16.h
https://github.com/ROCm/clr/tree/rocm-5.7.x/hipamd/include/hip/amd_detail/amd_hip_bf16.h
https://github.com/ROCm/clr/tree/rocm-5.7.x/hipamd/include/hip/amd_detail/amd_hip_bfloat16.h
Do you perhaps by chance have any clue as to why natively supported float16
behaves this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD HW does not have native bfloat16 arithmetic except for the matrix multiply operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Hmmm it seems some of the tests are failing due to allclose. https://buildkite.com/vllm/ci/builds/495#018d323c-6fa7-47ed-a078-04a2bc14f21c/51-7825 We are soft-failing kernel test due to GPU memory, but it shouldn't have any other error than CUDAOutOfMemory. |
Will check the failed ones. May be the message below of "All checks have passed" can be "improved"? |
Yeah i'm trying to figure that out sorry. |
@simon-mo Can this pull request be merged now? The longer it stays, more potential conflict will occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Will merge once the tests are only OOM not accuracy off.
This pull request is to fix some of the failed unit tests in ROCm platform, specifically,
test_activation.py
,test_attention.py
,test_pos_encoding.py
andtest_cache.py
in tests/kernels to fix the failed testsTests:
To run the associated tests on ROCm platform, first we need to be inside the docker container with ROCm pytorch and vllm install, and then copy the tests directory to the vllm installed location, and then run the test using
pytest
.Use the environment variable
PYTORCH_TEST_WITH_ROCM=1
if inside the ROCm 5.7 docker image as shown in the below example:This environment variable is already set inside the ROCm 6.0 docker image.