Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm-3.7+ broken on gfx803 #1172

Closed
xuhuisheng opened this issue Oct 23, 2020 · 5 comments
Closed

ROCm-3.7+ broken on gfx803 #1172

xuhuisheng opened this issue Oct 23, 2020 · 5 comments
Assignees

Comments

@xuhuisheng
Copy link

xuhuisheng commented Oct 23, 2020

What is the expected behavior

  • Dont crash and return correct loss on gfx803

What actually happens

  • Invalid argument: indices[5,284] = 997212422 is not in [0, 5001) (text classification)
  • Low accuracy with loss NaN (mnist)

How to reproduce

  • ROCm-3.7+ on gfx803, run tensorflow text classification sample. Tensorflow offical sample could reproduce this issue, almost 90%. https://www.tensorflow.org/tutorials/keras/text_classification
  • There are many people get this error, please refer here : ROCm-3.7+ broken on gfx803 ROCm#1265
  • Workaround 1: I rebuild rocBLAS with BUILD_WITH_TENSILE_HOST=false, and the problem dispeared, Maybe the gfx803 r9nano_*.yml is out-of-date? This way caused compiling failure on ROCm-3.9.
  • Workaround 2: keep BUILD_WITH_TENSILE_HOST=true, delete library/src/blas3/Tensile/Logic/asm_full/r9nano_Cijk_Ailk_Bljk_SB.yaml, and issue resolved. If I just keep one solution of this file, issue reproduced.

Environment

Hardware description
GPU gfx803 - RX580 8G (Polaris10) CHIP ID: 0x67df
CPU xeon 2620v3
Software version
ROCK v3.7, v3.8, v3.9, v3.10, v4.0
ROCR v3.7, v3.8, v3.9, v3.10, v4.0
HCC v3.7, v3.8, v3.9, v3.10, v4.0
Library v3.7, v3.8, v3.9, v3.10, v4.0
@xuhuisheng
Copy link
Author

Added more environment informations and workaround2.

@aruno14
Copy link

aruno14 commented Nov 22, 2020

I have the same issue on Fiji [Radeon R9 FURY / NANO Series].
I tried to rebuild rocBLAS after delete `library/src/blas3/Tensile/Logic/asm_full/r9nano_Cijk_Ailk_Bljk_SB.yaml

I used: sudo bash install.sh -id -a gfx803
However, I get below error when I use TensorFlow:

/src/external/hip-on-vdi/rocclr/hip_fatbin.cpp:39: guarantee(false && "Cannot unmap file")

@lolzballs
Copy link

I have the same problem as OP on the same chip (gfx803, RX580 8GB). I noticed that ROCm 3.10 was released a few days ago and I'm wondering if this fixed anything. Unfortunately my system doesn't have the resources to compile PyTorch or Tensorflow for the updated ROCm version and the docker images haven't been updated yet.

@aruno14
Copy link

aruno14 commented Dec 20, 2020

@lolzballs
I checked with ROCm 4.0 and last rocBLAS version, but I got the same error.
/src/external/hip-on-vdi/rocclr/hip_fatbin.cpp:39: guarantee(false && "Cannot unmap file")

@kotatsuyaki
Copy link

I'm experiencing the same problem on ROCm 4.0 docker image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants