Assign backfills a run status based on their sub-run statuses #23702

jamiedemaria · 2024-08-16T15:05:50Z

Summary & Motivation

Computes the DagsterRunStatus for a backfill based on the statuses of the sub-runs and the BulkAction status of the backfill, rather than just mapping BulkActionStatus -> DagsterRunStatus in a one-to-one fashion.

We might want to store the run status in the DB at some point to facilitate filtering, but as a first step I'm just adding it to the GQL layer where it will be faster to iterate on how sub-run statuses inform overall status and won't potentially require migrating old data if we change how we determine status.

How I Tested These Changes

added assertions on run status in existing tests

jamiedemaria · 2024-08-16T15:06:05Z

Assign backfills a run status based on their sub-run statuses #23702 👈
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jamiedemaria and the rest of your teammates on Graphite

jamiedemaria · 2024-08-19T19:14:20Z

python_modules/dagster-graphql/dagster_graphql/schema/backfill.py

+        if any(status == "CANCELED" for status in sub_run_statuses):
+            return GrapheneRunStatus.FAILURE
+
+        # can't import this because two deserializers get registered for PipelineRunStatsSnapshot


I'd rather use the code below to check the statuses of the runs, but if I import DagsterRunStatus I get the error dagster._serdes.errors.SerdesUsageError: Multiple deserializers registered for storage name 'PipelineRunStatsSnapshot' I think because the storage_nameforDagsterRunStatsSnapshot`

@whitelist_for_serdes(storage_name="PipelineRunStatsSnapshot") class DagsterRunStatsSnapshot( ...

collides with GraphenePipelineRunStatsSnapshot

Not sure what there is to de about this other than move DagsterRunStatus into a separate module

Not sure what there is to de about this other than move DagsterRunStatus into a separate module

Surprised that that would help with the serdes error, but moving it to a separate module seems like a nice thing to me.

sryza

We might want to store the run status in the DB at some point to facilitate filtering

Curious to hear @prha's thoughts, but I do think we should prioritize storing this in the DB. Backfills can have lots of runs, so computing these at read time could make loading the runs page much slower.

prha

What is this run status used for? I think the run status coercion is a bit odd, to be honest.

Is it because we didn't want to use a different type in the frontend?

prha · 2024-08-22T02:49:14Z

We might want to store the run status in the DB at some point to facilitate filtering

Curious to hear @prha's thoughts, but I do think we should prioritize storing this in the DB. Backfills can have lots of runs, so computing these at read time could make loading the runs page much slower.

That's true... if we wanted to store this in the DB, we'd have to add an aggregate run status column (or something like that)... we'd probably update it in the event log consumer daemon, as runs complete.

prha · 2024-08-22T02:50:17Z

I mostly don't like calling it runStatus because the thing is not really a run.

jamiedemaria · 2024-08-22T14:12:12Z

What is this run status used for? I think the run status coercion is a bit odd, to be honest.
Is it because we didn't want to use a different type in the frontend?

Pretty much. The reasoning is that if we have a consolidated list of runs and backfills, we want to display consistent status info about each entry. DagsterRunStatus has more granular information, and is the status that would be used if backfills were executed as single runs, so that's what got picked for the single status enum. The name runStatus is very changeable though.

prha · 2024-08-22T15:39:53Z

I think we should maybe rename to something else... maybe aggregateRunStatus or groupedRunStatus or something like that.

Doesn't have to be this diff, but I think we should also consider having a separate column in the DB on the bulk actions table for it, just to minimize data reads on runs page loads.

Just like we update the runs row status column on store event calls, we could update the bulk_action aggregate_run_status column.

sryza · 2024-08-22T15:43:23Z

Is there a world where we would consider replacing the bulk action status column with these statuses?

The existing bulk action statuses for an asset backfill are already largely based on the aggregate status of runs within the backfill.

prha · 2024-08-22T15:50:37Z

I think it depends on what actions we need to take based on the status.

There's an external status, which is what we would show to the user. I think that's fine to completely shift to this new aggregate status. But I think there may be some configurable policies in terms of what we internally need to kick off, w.r.t launching runs, retries, etc. It might still be useful to have an internal-only concept of status/state of the backfill.

prha · 2024-08-22T15:51:36Z

I think we do ourselves a disservice by calling everything status.

is displayable status
is execution state

sryza · 2024-08-22T16:43:25Z

To help myself get a stronger grip on what we're talking about here, is the main issue with the current set of statuses that it doesn't allow us to distinguish between these two different outcomes?

This backfill submitted all the runs it was initially intended to, and all of those runs succeeded
This backfill submitted all the runs it was initially intended to, but some of those runs failed

The separation you're talking about makes sense @prha , but I think I'm reacting negatively to the name aggregate_run_status because it seems like kind of an implementation detail. In my mind what we're missing is more like "did this backfill achieve what it set out to achieve?"

prha · 2024-08-22T17:27:56Z

I see that we're effectively querying "status" for 2 different purposes.

From the daemon to figure out if we need to do more work, where work is something like launching a new run.
From the UI to report on the progress of the overall body of work.

I don't really care what we call them, but I want to make sure we're not coalescing two things that should actually stay separate.

Separately, I prefer that we don't call #2 the same thing for both runs and backfills, because they represent two separate things. It feels like a liability for a future bug where we think we can make inferences based on this value the actions that we can take on the object. I don't have a strong attachment to the specific naming of it though.

jamiedemaria · 2024-08-23T14:03:21Z

I did look in to converting to storing DagsterRunStatuses a while ago and it's a bit cumbersome since it requires an OSS db migration and maintaining backcompat code, but certainly doable. I do agree with phil on keeping the enum to communicate status to the daemon separate from the display status, though.

We could lean in to the aggregated status being only about communicating externally and call it something like displayStatus

sryza · 2024-08-23T14:59:13Z

I see that we're effectively querying "status" for 2 different purposes.

Thinking about this a little bit more, one thing I could imagine is wanting to add functionality that automatically retries failed backfills (failed in the sense of submitted all runs but some failed). Yesterday I was chatting with a customer who wanted this. So I think this is more than just a cosmetic status.

From the other direction, the backfill daemon has both COMPLETED and FAILED statuses. Both of these mean "backfill daemon doesn't need to do more work", but they're valuable for reporting purposes.

Also, the run statuses on runs are there for a mix of operational and reporting purposes. E.g. the difference between FAILURE and CANCELED is mainly for reporting purposes, but QUEUED has operational value.

This makes me wonder whether we should separate out the statuses.

A third option to consider here would be to add a column called something like "completed_outcome", which is SUCCESS or FAILURE if the bulk action status is COMPLETED.

sryza · 2024-08-23T15:14:35Z

it requires an OSS db migration and maintaining backcompat code, but certainly doable

Oo yeah backcompat definitely something we need to think through if we don't want to compute the aggregate run status at read time. I think the easiest thing would be to say that backfills that completed prior to this change won't show up when someone filters for "run status=SUCCESS" or "run status=FAILURE" on the runs page. My strong suspicion is that people mostly filter on these statuses to monitor recent runs, so omitting some historical backfills is likely not a big deal.

jamiedemaria · 2024-09-10T16:15:28Z

python_modules/dagster-graphql/dagster_graphql/schema/backfill.py

+        if converted_status is BulkActionStatus.REQUESTED:
+            # if no runs have been launched:
+            if len(self._get_records(_graphene_info)) == 0:
+                return GrapheneRunStatus.QUEUED


should be not_started? starting? or just map to Started for all in requested state

I think STARTING might make sense here, but if it's STARTED, it's not a big deal.

jamiedemaria · 2024-09-11T19:38:15Z

closing in favor of #24365

jamiedemaria force-pushed the jamie/dagster-run-status-for-backfill branch from 8edf322 to a9049d3 Compare August 19, 2024 19:00

jamiedemaria commented Aug 19, 2024

View reviewed changes

jamiedemaria force-pushed the jamie/dagster-run-status-for-backfill branch from a9049d3 to e825e0d Compare August 20, 2024 14:29

jamiedemaria marked this pull request as ready for review August 20, 2024 16:53

jamiedemaria requested review from sryza and prha August 20, 2024 16:53

jamiedemaria added 5 commits August 20, 2024 13:18

Assign backfills a run status based on their sub-run statuses

0bd9318

tests

970782b

change return type for pyright

2bcfd43

not started to queued

2ed0dab

comment

b088121

jamiedemaria force-pushed the jamie/dagster-run-status-for-backfill branch from 4c99718 to b088121 Compare August 20, 2024 17:18

sryza reviewed Aug 20, 2024

View reviewed changes

prha reviewed Aug 22, 2024

View reviewed changes

jamiedemaria commented Sep 10, 2024

View reviewed changes

jamiedemaria closed this Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assign backfills a run status based on their sub-run statuses #23702

Assign backfills a run status based on their sub-run statuses #23702

jamiedemaria commented Aug 16, 2024 •

edited

Loading

jamiedemaria commented Aug 16, 2024

jamiedemaria Aug 19, 2024

sryza Aug 22, 2024

sryza left a comment

prha left a comment

prha commented Aug 22, 2024

prha commented Aug 22, 2024

jamiedemaria commented Aug 22, 2024

prha commented Aug 22, 2024

sryza commented Aug 22, 2024

prha commented Aug 22, 2024

prha commented Aug 22, 2024

sryza commented Aug 22, 2024 •

edited

Loading

prha commented Aug 22, 2024

jamiedemaria commented Aug 23, 2024 •

edited

Loading

sryza commented Aug 23, 2024 •

edited

Loading

sryza commented Aug 23, 2024

jamiedemaria Sep 10, 2024

prha Sep 10, 2024

jamiedemaria commented Sep 11, 2024

Assign backfills a run status based on their sub-run statuses #23702

Assign backfills a run status based on their sub-run statuses #23702

Conversation

jamiedemaria commented Aug 16, 2024 • edited Loading

Summary & Motivation

How I Tested These Changes

jamiedemaria commented Aug 16, 2024

jamiedemaria Aug 19, 2024

Choose a reason for hiding this comment

sryza Aug 22, 2024

Choose a reason for hiding this comment

sryza left a comment

Choose a reason for hiding this comment

prha left a comment

Choose a reason for hiding this comment

prha commented Aug 22, 2024

prha commented Aug 22, 2024

jamiedemaria commented Aug 22, 2024

prha commented Aug 22, 2024

sryza commented Aug 22, 2024

prha commented Aug 22, 2024

prha commented Aug 22, 2024

sryza commented Aug 22, 2024 • edited Loading

prha commented Aug 22, 2024

jamiedemaria commented Aug 23, 2024 • edited Loading

sryza commented Aug 23, 2024 • edited Loading

sryza commented Aug 23, 2024

jamiedemaria Sep 10, 2024

Choose a reason for hiding this comment

prha Sep 10, 2024

Choose a reason for hiding this comment

jamiedemaria commented Sep 11, 2024

jamiedemaria commented Aug 16, 2024 •

edited

Loading

sryza commented Aug 22, 2024 •

edited

Loading

jamiedemaria commented Aug 23, 2024 •

edited

Loading

sryza commented Aug 23, 2024 •

edited

Loading