-
Notifications
You must be signed in to change notification settings - Fork 53
Discard fields when replica number equals zero to avoid api client error #313
Conversation
Signed-off-by: byhsu <[email protected]>
833f7b1
to
b57c227
Compare
Signed-off-by: byhsu <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #313 +/- ##
==========================================
+ Coverage 63.02% 64.35% +1.32%
==========================================
Files 148 148
Lines 12149 9857 -2292
==========================================
- Hits 7657 6343 -1314
+ Misses 3912 2934 -978
Partials 580 580
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@ByronHsu Perhaps also include the symptoms of this bug in the PR. i.e., chief and ps processes are shown immediately terminated |
@ByronHsu thanks, this is awesome! Can you maybe link to the TF CRD version changes? So if you specify 0 for one of the replica sets (ex. chief, worker, etc) then it defaults to a minimum number of replicas? You can't control the replica configurations (ie. |
@hamersaw We used the CRD in
|
I'm going to assume that since these are all kf operators this functionality is the same for mpi and pytorch. Can we make the quick change to update those two plugins the same way this is updated? |
Signed-off-by: Daniel Rammer <[email protected]>
Added more checks to replica counts in kubeflow operators
@hamersaw Thanks for unifying kf components! |
…ror (#313) * Discard field when replica number equals zero to avoid api client error Signed-off-by: byhsu <[email protected]> * Improve comments Signed-off-by: byhsu <[email protected]> * added more checks to replica counts in kubeflow operators Signed-off-by: Daniel Rammer <[email protected]> --------- Signed-off-by: byhsu <[email protected]> Signed-off-by: Daniel Rammer <[email protected]> Co-authored-by: byhsu <[email protected]> Co-authored-by: Daniel Rammer <[email protected]>
TL;DR
In some versions of tensorflow CRD, there is an enforcement to restrict the minimum number of replica. If users pass in 0 as the number of replicas, the k8s client will throw an error at runtime. The field should be discarded if the number of replicas is zero.
Type
Bug
In the order version of tfjob CRD, it will validate the minimum number of replicas. If the replica == 0 but the field exists, the kubeflow client will fail.
Test
Are all requirements met?