-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] Artifacts of Components in ParallelFor / Sub-DAGs Overwritten by Concurrent Iterations #10186
Comments
@chensun we've been trying to figure this one out ourselves and have some ideas, but are also unclear on how to test them properly. The root of the problem is that the output path in V2 is named:
which is not unique across sub-dags. In V1 it was:
Which is unique due to the pod-id being present. Thus, we are trying to figure out if it's possible to simple update the artifact path to include the pod-name, as that should in theory fix our issue. The function generateOutputURI in return fmt.Sprintf("%s/%s", strings.TrimRight(root, "/"), path.Join(taskName, artifactName)) to: return fmt.Sprintf("%s/%s", strings.TrimRight(root, "/"), path.Join(taskName, component.EnvPodName, artifactName)) Could fix the issue as it'd make the output URIs unique across sub-dags. The problem is, I believe this fix will require the Could you assist us with the fix and/or advise us on how to properly build and incorporate the necessary images? |
Following up to the comment above, have the following questions regarding building the backend if we were to go ahead and do so. Any input on this would help:
|
Thank you @TristanGreathouse and @sachdevayash1910 for digging into the issue. The proposed fix looks good to me.
For driver/launcher changes, the minimum build include driver/launcher image build and apisever image build because driver/launcher images are currently hard coded in apiserver. So the steps are:
For your test build, you can use this makefile to build driver and launcher images. Alternative you can use their Dockerfiles (Dockerfile.driver & Dockerfile.launcher) which takes much longer time as it involves licensing check and notice file generation.
Yes, only one place: pipelines/backend/src/v2/compiler/argocompiler/argo.go Lines 119 to 120 in d9c3b86
No, you don't. cache server and cache deployer are for v1 caching feature only. Your change only applies to v2. Driver is where we handle caching for v2 pipelines.
metadata-writer is also for v1 only, you don't need this.
No. This is irrelevant to your change. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
Hey, we started to also hit this issue using ParallelFor. Could we reopen this issue? I also submitted simple PR (#11243) which I thing will fix the issue (tested on our cluster). It is very similar to suggested solution - it is adding adding random string to output uri path. |
…10186 (#11243) Signed-off-by: b4sus <[email protected]>
Environment
How do you deploy Kubeflow Pipelines (KFP)?
Using AWS.
KFP version:
1.8-rc5
KFP SDK version:
Steps to reproduce
The problem occurs when we run the same component across ParallelFor iterations OR sub-dags. I discovered the error due to observing unexpected behavior when running nested parallelfor loops, although it originates from the sub-dag structure itself and is not specific to ParallelFor loops.
What happens is that each component overwrites the artifacts of all other components created from the same component template / function because the auto-generated path for S3 is the same across ParallelFor iterations / sub-dags . The path is
s3://{bucket}/{pipeline_name}/{run_id}/{component_name}/{output_path_name}
. This means that if you run the same component for each iteration of the ParallelFor or across duplicate sub-dags, it overwrites itself in S3 (at least that's my theory).The deceptive part of this is that there is no explicit failure, but rather unexpected results originating from this overwriting of artifacts. Some examples of how this can happen are as follows. My general use-case to test this is simply creating a dataframe with a column filled with a passed string value. You can then see in the input/output artifact preview tab if the artifact has the correct column. For example, in the screenshot below, the value passed for this particular iteration of a ParallelFor is
b
, but the value in the displayed output artifact isa
.SUB-DAGS
The compiler to recreate the problem in sub-dags is as follows:
ParallelFor
The compiler to create the problem in ParallelFor is as follows.
Duplicate Components Same Level
I did test and confirm that duplicating the same component in the same-level of a DAG did not run into this issue as the auto-generated output paths were unique. The compiler for this test is here:
This issue essentially makes each iteration of a ParallelFor / sub-dag non-unique and not dependent on the specific inputs provided, and the artifacts generated from it non-determinstic. I believe it may be related to the fix introduced in this PR by @chensun from last week.
Expected result
The ParallelFor / sub-dag structures should not overwrite other components executed in concurrent ParallelFor iterations / sub-dags. We should be able to refer to the output artifact of a component within an iteration of a ParallelFor or sub-dag and retrieve the artifact that was generated within the same iteration (rather than a parallel one that has different args and thus different outputs).
Impacted by this bug? Give it a 👍.
The text was updated successfully, but these errors were encountered: