How to do an operation on multiple elements? #19679

marcilj · 2024-02-08T17:42:43Z

marcilj
Feb 8, 2024

I don't understand how can I do the same operation over multiples elements and it's bugging me so much.

Let's take the example in Dynamic Mapping & Collect.

@op(
    config_schema={
        "path": Field(str, default_value=file_relative_path(__file__, "sample"))
    },
    out=DynamicOut(str),
)
def files_in_directory(context):
    path = context.op_config["path"]
    dirname, _, filenames = next(os.walk(path))
    for file in filenames:
        yield DynamicOutput(os.path.join(dirname, file), mapping_key=_clean(file))

@job
def process_directory():
    files = files_in_directory()

    # use map to invoke an op on each dynamic output
    file_results = files.map(process_file)

    # use collect to gather the results in to a list
    summarize_directory(file_results.collect())

files_in_directory() -> Generates each element on which we will do an operation.
process_file() -> Is not define here, but would take a file and do something with it.
summarize_directory() -> Is not define here, but would take all the process file and do something with them like store them.

So I understand the map and collect part, you do the same operation with the map and you retrieve all response with the collect.

The issue I have is how can that work in production?

Let's take and example where process_file() convert a file from json to parquet and summarize_directory() store those files in a folder on S3.

If for any of the files process_file() function fail for any reason, none of the files will be converted, because the summarize_directory won't be able to collect all parquet to store them.

Let's take and example where process_file() convert a file from json to parquet and store it to S3.

That would work, but then the collect function isn't useful anymore.

If instead of using DynamicOutput we choose to trigger runs for each of the elements.

We could do something like this

from dagster import sensor, RunRequest, job, op, repository

@op
def process_file():
    pass

# Define a job that processes a single file
@job
def process_single_file():
    process_file()

# Define a sensor that triggers a run for each files
@sensor(job=process_single_file)
def file_processing_sensor(context):
    # Code to find files to process
    files_to_process = get_files_to_process()

    for file in files_to_process:
        # Create a run request for each file
        yield RunRequest(
            run_key=f"file-{file.id}",  # Unique identifier for the run
            run_config={
                "ops": {
                    "process_file": {
                        "inputs": {
                            "file": file,
                        }
                    }
                }
            },
        )

# Define a repository to hold the job and sensor
@repository
def my_repository():
    return [process_single_file, file_processing_sensor]

def get_file_to_process():
    return [{"id": 1, "data": "file1"}, {"id": 2, "data": "file2"}]

That solution doesn't seems like appropriate for dagster because when I look at the maximum runs in Dagster Cloud serverless it's cap to 50, so since It's your paid service, I expect that this isn't the way you expect the tool to be used.

TLDR;
Is there a way to execute an operation on multiple elements of the same nature while storing ALL of the successfull elements to avoid recomputing the same elements.

I'm probably missing something obvious but this is always a big problem for me.

So much that here's a precise example. (Not mine, but one that fits with the provided example)

Everyday at noon, I want to look at all the files in a folder on S3, convert the file from JSON to Parquet for each of the files returned. I want to store the result in another bucket. Since there might be a lot of files to process (Let's say 5 000 to exclude some options), and that this task might take a while (5 hours) I need to store the successful results (Parquet files) even if some of the files conversion fails. I also want to be alerted if any of the rows would fails, because if I'm not alerted they would never be fixed and I'll endup with incomplete data. Also I need to be able to retry all the failed files once they are fixed or once my code if fixed. Obviously, if 4999 files succeed, and one fails, the retry should only reprocess 1 file. This retry process can be done manually using Dagster UI.

nachofest · 2024-09-26T16:15:13Z

nachofest
Sep 26, 2024

Hi @marcilj did you manage to develop a solution for this approach, I feel like I am having similar issues

0 replies

tsitsimis · 2024-12-11T17:30:04Z

tsitsimis
Dec 11, 2024

I have the same issue. Is there any solution to this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do an operation on multiple elements? #19679

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How to do an operation on multiple elements? #19679

marcilj Feb 8, 2024

Replies: 2 comments

nachofest Sep 26, 2024

tsitsimis Dec 11, 2024

marcilj
Feb 8, 2024

nachofest
Sep 26, 2024

tsitsimis
Dec 11, 2024