Replies: 2 comments
-
Hi @marcilj did you manage to develop a solution for this approach, I feel like I am having similar issues |
Beta Was this translation helpful? Give feedback.
0 replies
-
I have the same issue. Is there any solution to this? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I don't understand how can I do the same operation over multiples elements and it's bugging me so much.
Let's take the example in Dynamic Mapping & Collect.
So I understand the
map
andcollect
part, you do the same operation with themap
and you retrieve all response with thecollect
.The issue I have is how can that work in production?
process_file()
convert a file fromjson
toparquet
andsummarize_directory()
store those files in a folder on S3.process_file()
function fail for any reason, none of the files will be converted, because the summarize_directory won't be able to collect all parquet to store them.process_file()
convert a file fromjson
toparquet
and store it to S3.collect
function isn't useful anymore.If instead of using
DynamicOutput
we choose to trigger runs for each of the elements.We could do something like this
That solution doesn't seems like appropriate for dagster because when I look at the maximum runs in Dagster Cloud serverless it's cap to 50, so since It's your paid service, I expect that this isn't the way you expect the tool to be used.
TLDR;
Is there a way to execute an operation on multiple elements of the same nature while storing ALL of the successfull elements to avoid recomputing the same elements.
I'm probably missing something obvious but this is always a big problem for me.
So much that here's a precise example. (Not mine, but one that fits with the provided example)
Everyday at noon, I want to look at all the files in a folder on S3, convert the file from JSON to Parquet for each of the files returned. I want to store the result in another bucket. Since there might be a lot of files to process (Let's say 5 000 to exclude some options), and that this task might take a while (5 hours) I need to store the successful results (Parquet files) even if some of the files conversion fails. I also want to be alerted if any of the rows would fails, because if I'm not alerted they would never be fixed and I'll endup with incomplete data. Also I need to be able to retry all the failed files once they are fixed or once my code if fixed. Obviously, if 4999 files succeed, and one fails, the retry should only reprocess 1 file. This retry process can be done manually using Dagster UI.
Beta Was this translation helpful? Give feedback.
All reactions