Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend] caching fails when task pods have WorkflowTaskResults RBAC (argo 3.4 change) #8942

Closed
thesuperzapper opened this issue Mar 8, 2023 · 21 comments
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.

Comments

@thesuperzapper
Copy link
Member

Issue

When the ServiceAccount used by argo-workflows task Pods has the RBAC to create WorkflowTaskResults resources, argo-workflows will change how it stores task outputs and stores them in these CRDs (previously patched the Pod to store them in annotations).

Kubeflow Pipelines does not know about the WorkflowTaskResults CRD, and fails in a strange way, specifically, workflows will successfully run the FIRST time they are run (when there is not a cache), but every following time they are run, all tasks will fail with the message This step is in Error state with this message: unexpected end of JSON input.

Here is an issue from someone with this problem: #8842

I expect we will see MANY more of this issue soon, as argo-workflows 3.4+ uses WorkflowTaskResults by default, and people will try and update from our packaged 3.3 version.

Screenshot:

Screenshot 2023-03-07 at 17 26 23


Impacted by this bug? Give it a 👍.

@Talador12
Copy link

@thesuperzapper
Copy link
Member Author

thesuperzapper commented May 16, 2023

@connor-mccarthy @chensun just wondering where we are on this issue?

I am seeing more users have this issue, even on Argo 3.3 (like in our Kubeflow 1.7 manifests).

Note, it happens when the workflow.spec.serviceAccountName that the argo workflow uses has RBAC to create WorkflowTaskResults.

Because workflows that are created by Kubeflow Pipelines use the ServiceAccount/default-editor (which is bound to ClusterRole/kubeflow-edit), if kubeflow-edit is given permissions (through role aggregation) to create WorkflowTaskResults, everything breaks, and KFP stops working in the described way.

We urgently need to make Kubeflow Pipelines aware of WorkflowTaskResults, because when a workflow exists with an empty workflows.argoproj.io/outputs annotation, Kubeflow Pipelines is unable to track that pipeline, and everything breaks.

@thesuperzapper
Copy link
Member Author

@chensun even if we don't upgrade our Argo to 3.4, we really need to make Kubeflow Pipelines aware of WorkflowTaskResults, because as described in #8942 (comment), if the ServiceAccount/default-editor has permission to create WorkflowTaskResults, it WILL use them instead of the workflows.argoproj.io/outputs Pod annotation which Kubeflow expects.

@chensun
Copy link
Member

chensun commented May 16, 2023

@thesuperzapper which distribution are you using? I don't see the issue with our latest GCP deployment, so I assume the "unwelcome" RBAC was not added in the base layer or any of GCP override of the manifest.

I looked up our code dependency on workflows.argoproj.io/outputs annotation, and it looks like they're only applicable to v1 pipeline code path. I would strongly suggest you start trying out v2 pipeline with our latest release (2.0.0-rc.1), we are targeting GA by next month.

@thesuperzapper
Copy link
Member Author

@chauhang it's not about me specifically, its just that many users are reporting this issue.

FYI, I imagine that most users who are accidentally adding this RBAC are doing so through role aggregation, that is any role in the cluster with the rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true" label will automatically be aggregated into ClusterRole/kubeflow-edit, which is bound to ServiceAccount/default-editor.

If we are planning to fully drop support for KFP 1.0 (once 2.0 is out), and we can 100% confirm that KFP 2.0 will support when Argo Workflows uses WorkflowTaskResults for results, then its probably fine to just warn users not to upgrade to Argo 3.4 (or allow the WorkflowTaskResults RBAC to be assigned to the default-editor).

Otherwise, if we plan to continue supporting KFP 1.0, we really must change the API so that both the workflows.argoproj.io/outputs annotation and WorkflowTaskResults will work.

PS: a single cluster can use a mixture of the annotation and CRD, this is because each specific workflow run makes a decision about which results type to use based on if it has the appropriate RBAC. So even KFP 2.0 MUST support both approaches.

@thesuperzapper
Copy link
Member Author

@chensun @james-jwu @zijianjoy I believe Kubeflow Pipelines v2 (and v1) does NOT support Argo Workflows using WorkflowTaskResults, a search of the repo will find we never reference WorkflowTaskResult in any of our go code.

We really must aim to support this ASAP, because even though WorkflowTaskResults is only used by default in Argo Workflows 3.4, in 3.3 (and earlier) if a Pod has WorkflowTaskResults permission it will use it instead of workflows.argoproj.io/outputs, which breaks Kubeflow Pipelines.

Also, others are reporting the same issue:

@chensun
Copy link
Member

chensun commented Jun 15, 2023

@thesuperzapper We'll look into this when we have a chance. The issue is triaged as awaits contributors. So feel free to open pull requests if you like, we'll be happy to review.

At this moment, I'm not convinced this is something worth being prioritized over everything else we've planned to work on, partially because we haven't heard similar issues from GCP customers (#8942 (comment)), also because in our previous attempt to upgrade to Argo 3.4 (#9301), our e2e tests all passed except for a frontend integration test--it is still a blocking issue we must address, but not that everything will break because we don't implement kfp in certain way that Argo supports.

@thesuperzapper
Copy link
Member Author

@chensun I am skeptical that upgrading to Argo 3.4 will work properly because of both this issue and #8935.

Are you 100% sure everything is working properly in #9301?

Also, this issue CAN occur on any distribution (including GCP), all it takes is for someone to run a pipeline with a service account that has permission to create WorkflowTaskResults, and then argo will not create the workflows.argoproj.io/outputs pod annotations for that run.

@chensun
Copy link
Member

chensun commented Jun 16, 2023

@chensun I am skeptical that upgrading to Argo 3.4 will work properly because of both this issue and #8935.

Are you 100% sure everything is working properly in #9301?

I believe you should have access to view the log of the presubmit tests linked in #9301 ? Here's the snapshot.

STEP                                              TEMPLATE                        PODNAME                                                           DURATION  MESSAGE
 ✖ integration-test-65dvq                         integration-test                                                                                            child 'integration-test-65dvq-3156101307' failed  
 ├─┬─✔ build-api-integration-test-image(0)        build-image                     integration-test-65dvq-build-image-3916643689                     5m                                                          
 │ ├─✔ build-basic-e2e-tests-image(0)             build-image                     integration-test-65dvq-build-image-655170568                      4m                                                          
 │ ├─✔ build-frontend-integration-tests-image(0)  build-image                     integration-test-65dvq-build-image-923838798                      3m                                                          
 │ └─✔ build-initialization-test-image(0)         build-image                     integration-test-65dvq-build-image-788876506                      5m                                                          
 ├───✔ run-initialization-tests                   run-initialization-tests        integration-test-65dvq-run-initialization-tests-2988339744        2m                                                          
 ├───✔ run-initialization-tests-v2                run-initialization-tests-v2     integration-test-65dvq-run-initialization-tests-v2-1235315324     1m                                                          
 ├───✔ run-api-integration-tests                  run-api-integration-tests       integration-test-65dvq-run-api-integration-tests-2497752853       3m                                                          
 ├───✔ run-api-integration-tests-v2               run-api-integration-tests-v2    integration-test-65dvq-run-api-integration-tests-v2-589196377     2m                                                          
 └───✖ run-frontend-integration-tests             run-frontend-integration-tests  integration-test-65dvq-run-frontend-integration-tests-3156101307  2m        Error (exit code 1)                               
++ exit 1
[test-timing] It took 19m6.814s.

Not that everything was fine (issue was also described in the linked issue from the PR), but yes, the overall workflow execution was fine. I also tested it manually in my cluster when debugging. You can sync to the PR and reproduce by yourself.
I didn't look closely if we might have other configs making Argo behave differently, it did just work out of the box for the general scenarios.

@alexec
Copy link

alexec commented Jul 11, 2023

WorkflowTaskResult is an internal CRD for Argo Workflows. I don’t think KFP should reference it directly as it may change.

@github-actions
Copy link

github-actions bot commented Oct 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 9, 2023
@thesuperzapper
Copy link
Member Author

This issue is still present, and will prevent Kubeflow from being used with Argo Workflows 3.4+.

@stale stale bot removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Oct 9, 2023
@juliusvonkohout
Copy link
Member

+1

Copy link

github-actions bot commented Mar 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 7, 2024
@thesuperzapper
Copy link
Member Author

This is absolutely not stale.

@chensun, have we made any progress on getting Argo workflows 3.4+ working?

@rimolive
Copy link
Member

@thesuperzapper Just fyi we just merged the Argo 3.4 work in #10568 but we still need to cut the release to cover that new change. Stay tuned for more updates soon!

@juliusvonkohout
Copy link
Member

@thesuperzapper Just fyi we just merged the Argo 3.4 work in #10568 but we still need to cut the release to cover that new change. Stay tuned for more updates soon!

So did you cover the broken caching in the Argo 3.4 PR?

@rimolive
Copy link
Member

We hope so, @gmfrasca ran a test with a build from master and it looks good. Here's the screenshot:

image (3)

@thesuperzapper
Copy link
Member Author

@rimolive what about a V1 pipeline (as we are meant to be backward compatible)?

Copy link

github-actions bot commented Jul 2, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jul 2, 2024
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend kind/bug lifecycle/stale The issue / pull request is stale, any activities remove this label.
Projects
Status: Closed
Development

No branches or pull requests

6 participants