Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assign backfills a run status based on their sub-run statuses #23702

Closed
wants to merge 5 commits into from

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Aug 16, 2024

Summary & Motivation

Computes the DagsterRunStatus for a backfill based on the statuses of the sub-runs and the BulkAction status of the backfill, rather than just mapping BulkActionStatus -> DagsterRunStatus in a one-to-one fashion.

We might want to store the run status in the DB at some point to facilitate filtering, but as a first step I'm just adding it to the GQL layer where it will be faster to iterate on how sub-run statuses inform overall status and won't potentially require migrating old data if we change how we determine status.

How I Tested These Changes

added assertions on run status in existing tests

Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @jamiedemaria and the rest of your teammates on Graphite Graphite

@jamiedemaria jamiedemaria force-pushed the jamie/dagster-run-status-for-backfill branch from 8edf322 to a9049d3 Compare August 19, 2024 19:00
if any(status == "CANCELED" for status in sub_run_statuses):
return GrapheneRunStatus.FAILURE

# can't import this because two deserializers get registered for PipelineRunStatsSnapshot
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use the code below to check the statuses of the runs, but if I import DagsterRunStatus I get the error dagster._serdes.errors.SerdesUsageError: Multiple deserializers registered for storage name 'PipelineRunStatsSnapshot' I think because the storage_nameforDagsterRunStatsSnapshot`

@whitelist_for_serdes(storage_name="PipelineRunStatsSnapshot")
class DagsterRunStatsSnapshot(
...

collides with GraphenePipelineRunStatsSnapshot

Not sure what there is to de about this other than move DagsterRunStatus into a separate module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what there is to de about this other than move DagsterRunStatus into a separate module

Surprised that that would help with the serdes error, but moving it to a separate module seems like a nice thing to me.

@jamiedemaria jamiedemaria force-pushed the jamie/dagster-run-status-for-backfill branch from a9049d3 to e825e0d Compare August 20, 2024 14:29
@jamiedemaria jamiedemaria marked this pull request as ready for review August 20, 2024 16:53
@jamiedemaria jamiedemaria requested review from sryza and prha August 20, 2024 16:53
@jamiedemaria jamiedemaria force-pushed the jamie/dagster-run-status-for-backfill branch from 4c99718 to b088121 Compare August 20, 2024 17:18
Copy link
Contributor

@sryza sryza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to store the run status in the DB at some point to facilitate filtering

Curious to hear @prha's thoughts, but I do think we should prioritize storing this in the DB. Backfills can have lots of runs, so computing these at read time could make loading the runs page much slower.

Copy link
Member

@prha prha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this run status used for? I think the run status coercion is a bit odd, to be honest.

Is it because we didn't want to use a different type in the frontend?

@prha
Copy link
Member

prha commented Aug 22, 2024

We might want to store the run status in the DB at some point to facilitate filtering

Curious to hear @prha's thoughts, but I do think we should prioritize storing this in the DB. Backfills can have lots of runs, so computing these at read time could make loading the runs page much slower.

That's true... if we wanted to store this in the DB, we'd have to add an aggregate run status column (or something like that)... we'd probably update it in the event log consumer daemon, as runs complete.

@prha
Copy link
Member

prha commented Aug 22, 2024

I mostly don't like calling it runStatus because the thing is not really a run.

@jamiedemaria
Copy link
Contributor Author

What is this run status used for? I think the run status coercion is a bit odd, to be honest.
Is it because we didn't want to use a different type in the frontend?

Pretty much. The reasoning is that if we have a consolidated list of runs and backfills, we want to display consistent status info about each entry. DagsterRunStatus has more granular information, and is the status that would be used if backfills were executed as single runs, so that's what got picked for the single status enum. The name runStatus is very changeable though.

@prha
Copy link
Member

prha commented Aug 22, 2024

I think we should maybe rename to something else... maybe aggregateRunStatus or groupedRunStatus or something like that.

Doesn't have to be this diff, but I think we should also consider having a separate column in the DB on the bulk actions table for it, just to minimize data reads on runs page loads.

Just like we update the runs row status column on store event calls, we could update the bulk_action aggregate_run_status column.

@sryza
Copy link
Contributor

sryza commented Aug 22, 2024

Is there a world where we would consider replacing the bulk action status column with these statuses?

The existing bulk action statuses for an asset backfill are already largely based on the aggregate status of runs within the backfill.

@prha
Copy link
Member

prha commented Aug 22, 2024

I think it depends on what actions we need to take based on the status.

There's an external status, which is what we would show to the user. I think that's fine to completely shift to this new aggregate status. But I think there may be some configurable policies in terms of what we internally need to kick off, w.r.t launching runs, retries, etc. It might still be useful to have an internal-only concept of status/state of the backfill.

@prha
Copy link
Member

prha commented Aug 22, 2024

I think we do ourselves a disservice by calling everything status.

  1. is displayable status
  2. is execution state

@sryza
Copy link
Contributor

sryza commented Aug 22, 2024

To help myself get a stronger grip on what we're talking about here, is the main issue with the current set of statuses that it doesn't allow us to distinguish between these two different outcomes?

  • This backfill submitted all the runs it was initially intended to, and all of those runs succeeded
  • This backfill submitted all the runs it was initially intended to, but some of those runs failed

The separation you're talking about makes sense @prha , but I think I'm reacting negatively to the name aggregate_run_status because it seems like kind of an implementation detail. In my mind what we're missing is more like "did this backfill achieve what it set out to achieve?"

@prha
Copy link
Member

prha commented Aug 22, 2024

I see that we're effectively querying "status" for 2 different purposes.

  1. From the daemon to figure out if we need to do more work, where work is something like launching a new run.
  2. From the UI to report on the progress of the overall body of work.

I don't really care what we call them, but I want to make sure we're not coalescing two things that should actually stay separate.

Separately, I prefer that we don't call #2 the same thing for both runs and backfills, because they represent two separate things. It feels like a liability for a future bug where we think we can make inferences based on this value the actions that we can take on the object. I don't have a strong attachment to the specific naming of it though.

@jamiedemaria
Copy link
Contributor Author

jamiedemaria commented Aug 23, 2024

I did look in to converting to storing DagsterRunStatuses a while ago and it's a bit cumbersome since it requires an OSS db migration and maintaining backcompat code, but certainly doable. I do agree with phil on keeping the enum to communicate status to the daemon separate from the display status, though.

We could lean in to the aggregated status being only about communicating externally and call it something like displayStatus

@sryza
Copy link
Contributor

sryza commented Aug 23, 2024

I see that we're effectively querying "status" for 2 different purposes.

Thinking about this a little bit more, one thing I could imagine is wanting to add functionality that automatically retries failed backfills (failed in the sense of submitted all runs but some failed). Yesterday I was chatting with a customer who wanted this. So I think this is more than just a cosmetic status.

From the other direction, the backfill daemon has both COMPLETED and FAILED statuses. Both of these mean "backfill daemon doesn't need to do more work", but they're valuable for reporting purposes.

Also, the run statuses on runs are there for a mix of operational and reporting purposes. E.g. the difference between FAILURE and CANCELED is mainly for reporting purposes, but QUEUED has operational value.

This makes me wonder whether we should separate out the statuses.

A third option to consider here would be to add a column called something like "completed_outcome", which is SUCCESS or FAILURE if the bulk action status is COMPLETED.

@sryza
Copy link
Contributor

sryza commented Aug 23, 2024

it requires an OSS db migration and maintaining backcompat code, but certainly doable

Oo yeah backcompat definitely something we need to think through if we don't want to compute the aggregate run status at read time. I think the easiest thing would be to say that backfills that completed prior to this change won't show up when someone filters for "run status=SUCCESS" or "run status=FAILURE" on the runs page. My strong suspicion is that people mostly filter on these statuses to monitor recent runs, so omitting some historical backfills is likely not a big deal.

if converted_status is BulkActionStatus.REQUESTED:
# if no runs have been launched:
if len(self._get_records(_graphene_info)) == 0:
return GrapheneRunStatus.QUEUED
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be not_started? starting? or just map to Started for all in requested state

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think STARTING might make sense here, but if it's STARTED, it's not a big deal.

@jamiedemaria
Copy link
Contributor Author

closing in favor of #24365

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants