Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Discard fields when replica number equals zero to avoid api client error #313

Merged
merged 4 commits into from
Feb 1, 2023

Conversation

ByronHsu
Copy link
Contributor

@ByronHsu ByronHsu commented Jan 29, 2023

TL;DR

In some versions of tensorflow CRD, there is an enforcement to restrict the minimum number of replica. If users pass in 0 as the number of replicas, the k8s client will throw an error at runtime. The field should be discarded if the number of replicas is zero.

Type

  • Bug Fix
  • Feature
  • Plugin

Bug

In the order version of tfjob CRD, it will validate the minimum number of replicas. If the replica == 0 but the field exists, the kubeflow client will fail.

Test

byhsu@byhsu-ld1 ~/local-repo/flyte/flyteplugins gh-tf-replica-bug* ❯ go test github.com/flyteorg/flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow
ok      github.com/flyteorg/flyteplugins/go/tasks/plugins/k8s/kfoperators/tensorflow    0.072s

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

@ByronHsu ByronHsu changed the title Discard field when replica number equals zero to avoid api client error Discard fields when replica number equals zero to avoid api client error Jan 29, 2023
Signed-off-by: byhsu <[email protected]>
@codecov
Copy link

codecov bot commented Jan 30, 2023

Codecov Report

Merging #313 (88e2b40) into master (4634a81) will increase coverage by 1.32%.
The diff coverage is 100.00%.

❗ Current head 88e2b40 differs from pull request most recent head 6b7b974. Consider uploading reports for the commit 6b7b974 to get more accurate results

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
+ Coverage   63.02%   64.35%   +1.32%     
==========================================
  Files         148      148              
  Lines       12149     9857    -2292     
==========================================
- Hits         7657     6343    -1314     
+ Misses       3912     2934     -978     
  Partials      580      580              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
go/tasks/plugins/k8s/kfoperators/mpi/mpi.go 71.00% <100.00%> (+0.03%) ⬆️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 68.60% <100.00%> (+0.64%) ⬆️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go 71.27% <100.00%> (-0.28%) ⬇️
go/tasks/plugins/array/inputs.go 83.33% <0.00%> (-4.91%) ⬇️
.../pluginmachinery/internal/webapi/plugin_context.go 50.00% <0.00%> (-3.85%) ⬇️
go/tasks/plugins/k8s/sagemaker/plugin.go 74.07% <0.00%> (-3.20%) ⬇️
go/tasks/plugins/array/core/state.go 69.13% <0.00%> (-2.85%) ⬇️
go/tasks/pluginmachinery/flytek8s/pod_helper.go 81.43% <0.00%> (-2.64%) ⬇️
go/tasks/plugins/hive/config/config.go 33.33% <0.00%> (-2.39%) ⬇️
go/tasks/plugins/presto/config/config.go 33.33% <0.00%> (-2.39%) ⬇️
... and 125 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@goyalankit
Copy link
Contributor

@ByronHsu Perhaps also include the symptoms of this bug in the PR. i.e., chief and ps processes are shown immediately terminated

@hamersaw
Copy link
Contributor

@ByronHsu thanks, this is awesome! Can you maybe link to the TF CRD version changes? So if you specify 0 for one of the replica sets (ex. chief, worker, etc) then it defaults to a minimum number of replicas? You can't control the replica configurations (ie. Template, RestartPolicy) then right?

@hamersaw hamersaw self-requested a review January 30, 2023 21:35
@ByronHsu
Copy link
Contributor Author

ByronHsu commented Jan 30, 2023

@hamersaw We used the CRD in tf-operator, but It might because tf-operator repository is depreacted, I cannot find it online. The CRD we used is this.

So if you specify 0 for one of the replica sets (ex. chief, worker, etc) then it defaults to a minimum number of replicas?
It will not default to the min. It will fail instead.

hamersaw
hamersaw previously approved these changes Jan 31, 2023
@hamersaw
Copy link
Contributor

I'm going to assume that since these are all kf operators this functionality is the same for mpi and pytorch. Can we make the quick change to update those two plugins the same way this is updated?

@ByronHsu
Copy link
Contributor Author

ByronHsu commented Jan 31, 2023

It seems that mpi already handled, and pytorch always has one master. So they should be good.

hamersaw and others added 2 commits February 1, 2023 06:18
Added more checks to replica counts in kubeflow operators
@ByronHsu
Copy link
Contributor Author

ByronHsu commented Feb 1, 2023

@hamersaw Thanks for unifying kf components!

@hamersaw hamersaw merged commit 471a262 into flyteorg:master Feb 1, 2023
eapolinario pushed a commit that referenced this pull request Sep 6, 2023
…ror (#313)

* Discard field when replica number equals zero to avoid api client error

Signed-off-by: byhsu <[email protected]>

* Improve comments

Signed-off-by: byhsu <[email protected]>

* added more checks to replica counts in kubeflow operators

Signed-off-by: Daniel Rammer <[email protected]>

---------

Signed-off-by: byhsu <[email protected]>
Signed-off-by: Daniel Rammer <[email protected]>
Co-authored-by: byhsu <[email protected]>
Co-authored-by: Daniel Rammer <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants