-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] caching fails when task pods have WorkflowTaskResults
RBAC (argo 3.4 change)
#8942
Comments
Some discussion in slack here |
@connor-mccarthy @chensun just wondering where we are on this issue? I am seeing more users have this issue, even on Argo 3.3 (like in our Kubeflow 1.7 manifests). Note, it happens when the Because workflows that are created by Kubeflow Pipelines use the We urgently need to make Kubeflow Pipelines aware of |
@chensun even if we don't upgrade our Argo to 3.4, we really need to make Kubeflow Pipelines aware of |
@thesuperzapper which distribution are you using? I don't see the issue with our latest GCP deployment, so I assume the "unwelcome" RBAC was not added in the base layer or any of GCP override of the manifest. I looked up our code dependency on |
@chauhang it's not about me specifically, its just that many users are reporting this issue. FYI, I imagine that most users who are accidentally adding this RBAC are doing so through role aggregation, that is any role in the cluster with the If we are planning to fully drop support for KFP 1.0 (once 2.0 is out), and we can 100% confirm that KFP 2.0 will support when Argo Workflows uses Otherwise, if we plan to continue supporting KFP 1.0, we really must change the API so that both the PS: a single cluster can use a mixture of the annotation and CRD, this is because each specific workflow run makes a decision about which results type to use based on if it has the appropriate RBAC. So even KFP 2.0 MUST support both approaches. |
@chensun @james-jwu @zijianjoy I believe Kubeflow Pipelines v2 (and v1) does NOT support Argo Workflows using We really must aim to support this ASAP, because even though Also, others are reporting the same issue: |
@thesuperzapper We'll look into this when we have a chance. The issue is triaged as awaits contributors. So feel free to open pull requests if you like, we'll be happy to review. At this moment, I'm not convinced this is something worth being prioritized over everything else we've planned to work on, partially because we haven't heard similar issues from GCP customers (#8942 (comment)), also because in our previous attempt to upgrade to Argo 3.4 (#9301), our e2e tests all passed except for a frontend integration test--it is still a blocking issue we must address, but not that everything will break because we don't implement kfp in certain way that Argo supports. |
@chensun I am skeptical that upgrading to Argo 3.4 will work properly because of both this issue and #8935. Are you 100% sure everything is working properly in #9301? Also, this issue CAN occur on any distribution (including GCP), all it takes is for someone to run a pipeline with a service account that has permission to create |
I believe you should have access to view the log of the presubmit tests linked in #9301 ? Here's the snapshot.
Not that everything was fine (issue was also described in the linked issue from the PR), but yes, the overall workflow execution was fine. I also tested it manually in my cluster when debugging. You can sync to the PR and reproduce by yourself. |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue is still present, and will prevent Kubeflow from being used with Argo Workflows 3.4+. |
+1 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is absolutely not stale. @chensun, have we made any progress on getting Argo workflows 3.4+ working? |
@thesuperzapper Just fyi we just merged the Argo 3.4 work in #10568 but we still need to cut the release to cover that new change. Stay tuned for more updates soon! |
So did you cover the broken caching in the Argo 3.4 PR? |
We hope so, @gmfrasca ran a test with a build from master and it looks good. Here's the screenshot: |
@rimolive what about a V1 pipeline (as we are meant to be backward compatible)? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Issue
When the
ServiceAccount
used byargo-workflows
task Pods has the RBAC to createWorkflowTaskResults
resources, argo-workflows will change how it stores task outputs and stores them in these CRDs (previously patched the Pod to store them in annotations).Kubeflow Pipelines does not know about the
WorkflowTaskResults
CRD, and fails in a strange way, specifically, workflows will successfully run the FIRST time they are run (when there is not a cache), but every following time they are run, all tasks will fail with the messageThis step is in Error state with this message: unexpected end of JSON input
.Here is an issue from someone with this problem: #8842
I expect we will see MANY more of this issue soon, as
argo-workflows
3.4+ usesWorkflowTaskResults
by default, and people will try and update from our packaged 3.3 version.Screenshot:
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: