-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFjob pods hang without explanation #1156
Comments
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
@jazzsir Try to do Top command and check if your code is using CPU because I see my issue similar to yours and I could able to solve that building new docker image with inheriting tensorflow/tensorflow:1.15.2-gpu container and added my set of requirements on top of that with my python source code. |
Issue-Label Bot is automatically applying the labels:
Please mark this comment with 👍 or 👎 to give our bot feedback! |
I am not sure if it is a problem about TFJob. Looks like a problem about nvidia gpu-device-plugin or driver. |
I figured it out, two workers I ran were scheduled in a same node and the GPUs in the node needed to be able to be shared by them. |
I applied a TFjob including MultiWorkerMirroredStrategy example codes(https://www.tensorflow.org/tutorials/distribute/keras)
But all pods hang after printing below logs.
And GPU memories are occupied but all states of
GPU-Util Compute M.
is 0%.So I checked GPU states in a jupyter notebook but I can't find any problem
The text was updated successfully, but these errors were encountered: