Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added condition to check if it is a scheduled save or rerun #43453

Conversation

krzysztof-kubis
Copy link
Contributor

What?

I added a condition distinguishing a scheduled run from a rerun.

  • a scheduled run always starts a run of all models
  • rerun, when the flag retry_from_failure=True is set, should only build models that failed in the previous run.

Why?

The problem is that if there is an error in one model that can't be resolved (e.g., due to a data source issue), the flag prevented the other models from being refreshed, even in subsequent scheduled runs.

How?

  • first run calls {account_id}/jobs/{job_id}/run/
  • next runs calls {account_id}/jobs/{job_id}/rerun/

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link

boring-cyborg bot commented Oct 28, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented Nov 5, 2024

Can you please add a unit test covering it?

@jaklan
Copy link

jaklan commented Nov 5, 2024

@potiuk

Can you please add a unit test covering it?

We will, but first we want to know if you (== anyone responsible for dbt Cloud Operator) are fine with the approach

@potiuk
Copy link
Member

potiuk commented Nov 5, 2024

@potiuk

Can you please add a unit test covering it?

That was the part of first pass of review. At this stage assesment from me is that the change lacks unit test - so it won't be generally accepted anyway.

We will, but first we want to know if you (== anyone responsible for dbt Cloud Operator) are fine with the approach

Maybe. I have no idea, I am not a dbt expert. Possibly someone else who is will take a look. But a lot of people won't even look if there is no unit test, we generally don't accept PRs where unit tests work before and after the change - because it means that the change is not testable and prone to regression.

Also unit tests help to guide reviewer in understanding what the change does - unit tests shows the scenarios in which the change gets used. Often unit test explain the reason and circumstances of the change better than words, and as such they are often used - especially when changes are small - as the review guideline. Also it shows that the author knows what they are doing since they are able to reproduce the situation they are talking about in unit tests.

@potiuk
Copy link
Member

potiuk commented Nov 5, 2024

BTW. There is no-one "responsible" - this is not how open-source work. There are > 3000 mostly volunteers - who similarl y as you - are both contributing and reviewing the code in their free time away from their day jobs and families, so the best course of action to get your PR merged, is to follow reviewer's comments and asks, and ping (in general not individual people) that your PR needs review.

@jaklan
Copy link

jaklan commented Nov 5, 2024

@potiuk

That was the part of first pass of review. At this stage assesment from me is that the change lacks unit test - so it won't be generally accepted anyway.

We know it won't be accepted - the point is to get approval to finalize the PR.

Maybe. I have no idea, I am not a dbt expert. Possibly someone else who is will take a look. But a lot of people won't even look if there is no unit test, we generally don't accept PRs where unit tests work before and after the change - because it means that the change is not testable and prone to regression.

Also unit tests help to guide reviewer in understanding what the change does - unit tests shows the scenarios in which the change gets used. Often unit test explain the reason and circumstances of the change better than words, and as such they are often used - especially when changes are small - as the review guideline. Also it shows that the author knows what they are doing since they are able to reproduce the situation they are talking about in unit tests.

I don't see any value in writing unit tests before there's confirmation the approach is fine, otherwise it's just a waste of time of contributors. You start with design, not with coding (that's why the confirmation should happen first in the issue anyway, the PR was opened because you asked for it), and the reasoning is described in both the issue and MR description. It's more than enough to provide a feedback if it makes sense conceptually for someone familiar with given provider, especially in a situation when a few potential solutions were proposed.

There is no-one "responsible" - this is not how open-source work. There are > 3000 mostly volunteers - who similarl y as you - are both contributing and reviewing the code in their free time away from their day jobs and families, so the best course of action to get your PR merged, is to follow reviewer's comments and asks, and ping (in general not individual people) that your PR needs review.

But there has to be someone responsible for the provider, who knows its specifics and who takes the final decision - I guess you don't merge PRs related to providers you don't know just because they seem to be documented and tested, you have to understand the impact.

@potiuk
Copy link
Member

potiuk commented Nov 5, 2024

I don't see any value in writing unit tests before there's confirmation the approach is fine, otherwise it's just a waste of time of contributors. You start with design, not with coding (that's why the confirmation should happen first in the issue anyway, the PR was opened because you asked for it), and the reasoning is described in both the issue and MR description. It's more than enough to provide a feedback if it makes sense conceptually for someone familiar with given provider, especially in a situation when a few potential solutions were proposed.

Yes. You trade the time of contributors with time of maintainers. And no. It's not a general "truth" that things start from design. There is an idea of "emerging design" that comes from discussion on a code implemented and tests showing how it work. There is even "test driven development" so statement suggesting that there is the only way of doing things starting with design is totally unfounded. And here we prefer to see the code and tests to be able to assess the impact - that saves precious time of maitainers who (I am sure you can attempt to empathise wiht them) sometimes review 10s of PRs a day and do it in their free, voluntary time. So the best thing you can do is to follow the way we do it here. If we follow things each of 3000+ contributors think is "better" , that would make it impossible to manage the flow of incoming contributions. So if you would like to join 3000 contributors, I think it makes sense that you adapt to the way how they work, not the other way round. Or at least that seems more reasonable from common sense point of view.

But there has to be someone responsible for the provider, who knows its specifics and who takes the final decision - I guess you don't merge PRs related to providers you don't know just because they seem to be documented and tested, you have to understand the impact.

No. This is how things work in corporate-driven projects. This is not how things work in an open-source projects governed by the Apache Software Foundation. There is no-one responsible for this particular piece. You can see the history of contributors by following git history of the files you modified

git log --follow providers/src/airflow/providers/dbt/cloud/operators/dbt.py

And see if there are changes relevant to your and ping those contributors if you think their opinion is relevant, but list of all contributors is all that you can see this way. We merge changes if the author (and reviewers who might or might not be maintainers) convince us that they tested the change and review looks good. Also when we release such a change, that is merged the authors are asked to check release candidates to confirm that their change worked as expected and that they tested it - see for example here #43615. And we can always revert or fix it when you find it's not during your testing, and produce rc2 etc.

So ultimately it's actually your responsibility as an author to test and "convince" maintainers that the change is tested enough and to test it when release candidate is sent for voting . This is why also we ask you to add unit tests, to get more confidence that you understand the process and you are going to follow it - including testing the release.

We treat contributors very seriously here.

@jaklan
Copy link

jaklan commented Nov 5, 2024

@potiuk

Yes. You trade the time of contributors with time of maintainers. And no. It's not a general "truth" that things start from design. There is an idea of "emerging design" that comes from discussion on a code implemented and tests showing how it work. There is even "test driven development" so statement suggesting that there is the only way of doing things starting with design is totally unfounded. And here we prefer to see the code and tests to be able to assess the impact - that saves precious time of maitainers who (I am sure you can attempt to empathise wiht them) sometimes review 10s of PRs a day and do it in their free, voluntary time.

We don't trade the time of contributors with time of maintainers - it's beneficial for both sides to agree on the direction before you go deep into code changes.

Neither "emerging design" nor TDD means you start without the expected goal. Although you want to say you first write tests, then you write logic, and only after that you start defining the outcome in issues / user stories, but that would be, well, very unique approach.

So the best thing you can do is to follow the way we do it here. If we follow things each of 3000+ contributors think is "better" , that would make it impossible to manage the flow of incoming contributions. So if you would like to join 3000 contributors, I think it makes sense that you adapt to the way how they work, not the other way round. Or at least that seems more reasonable from common sense point of view.

Yes, we will follow the "PR requirements", but that's not the point. In that case we can do the full implementation and be blocked at the very last stage because of something aimed to be discussed at the beginning, or we can implement something very opinionated which will be accepted just because "code and tests look fine". Neither outcome is good.

So ultimately it's actually your responsibility as an author to test and "convince" maintainers that the change is tested enough and to test it when release candidate is sent for voting . This is why also we ask you to add unit tests, to get more confidence that you understand the process and you are going to follow it - including testing the release.

We are back to the beginning again. The whole discussion is not about "convincing" anyone if the implementation is correct, that part is quite obvious, but about the common agreement which approach makes sense. We would change the logic, someone else would say they prefer the another one and create a new PR, and we will go back-and-forth indefinitely?

If you say there's no one within Airflow contributors to make such decision (which I really don't get - someone has to be a reviewer and someone has to accept the PR anyway - not only, I hope, by reviewing the quality of code - so such person could provide a feedback already now as well), then we can just reach out to dbt Labs folks to find the "decision maker" there. With a hope that their agreement would be a sufficient argument for reviewers to accept the agreed solution (once implementation is finalised, of course).

@potiuk
Copy link
Member

potiuk commented Nov 6, 2024

Sorry, I have no time to lose on that discussion, but I think if you want to contribute, I suggest you to follow reviewer's request rather than argue with them. This is not the best way to contribute. And if you are not able to understand and adjust, maybe simple open source contribution is not for you.

@potiuk
Copy link
Member

potiuk commented Nov 6, 2024

And yes. if you want to reach out to DBT maintainers - feel free. As I said we treat contributors seriously, if you need dbt maintainers opinion on that - feel free to reach to them bring them here and state their opinion.

@jaklan
Copy link

jaklan commented Nov 6, 2024

And if you are not able to understand and adjust, maybe simple open source contribution is not for you

As I said we treat contributors seriously

Of course you do - respecting their time and themselves in the way presented above is the best proof of this 😉 Sorry for wasting your time by raising concerns that you are unable to answer without ad personam. But in fact, it is a valuable feedback that we can significantly change the logic of the operator and no one will even wonder about the rationale behind it - as long as there are tests. We will definitely raise that issue during discussion with dbt folks so they are aware what can happen with their provider.

@potiuk
Copy link
Member

potiuk commented Nov 6, 2024

with dbt folks so they are aware what can happen with their provider.

Please do. Maybe they will contribute back - including the tests. That's all I am asking for - to test providers and provide unit test for them. If your discussion with dbt will bring them here to contribute - I am happy as a bunny.

I also have not noticed any ad personam in whatever I wrote, but maye we have a different definition of ad personam.

@pierrejeambrun pierrejeambrun changed the title Aadded condition to check if it is a scheduled save or rerun Added condition to check if it is a scheduled save or rerun Nov 8, 2024
@pierrejeambrun
Copy link
Member

pierrejeambrun commented Nov 8, 2024

Hello guys,

@krzysztof-kubis @jaklan do you plan on continuing your work on that PR, just to know what do to with this one? This will most likely help many other users facing the same problem, I think it's worth pursuing.

I noticed the branch is in a weird state, most likely due to a bad rebase I assume.

Best regards,

@krzysztof-kubis
Copy link
Contributor Author

krzysztof-kubis commented Nov 8, 2024

I noticed the branch is in a weird state, most likely due to a bad rebase I assume.

Solved!

@krzysztof-kubis krzysztof-kubis force-pushed the 43347-problem-with-retry_from_failure-flag-in-DbtCloudRunJobOperator branch from 31d3fda to d988fa2 Compare November 8, 2024 15:21
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - the tests now explain what is the expectation . Thanks for adding them! @pierrejeambrun Are you also OK with it?

@jaklan
Copy link

jaklan commented Nov 9, 2024

Next week we have a call with dbt folks - we will try to confirm with them whether it's fine to go with this simplified approach or we should implement the more robust one allowing to have more control on run mode in dbt Cloud, so sth like:

rerun_from_failure = "never" | "always" | "when_task_retried"

(although it's also not perfect because Airflow task could be retried not because of dbt Cloud job failure)

@josh-fell
Copy link
Contributor

Next week we have a call with dbt folks - we will try to confirm with them whether it's fine to go with this simplified approach or we should implement the more robust one allowing to have more control on run mode in dbt Cloud, so sth like:

rerun_from_failure = "never" | "always" | "when_task_retried"

(although it's also not perfect because Airflow task could be retried not because of dbt Cloud job failure)

Nothing wrong with incremental changes. The dbt folks were, at least peripherally, involved in the initial creation of this provider. Continued collaboration is definitely welcomed. This provider has come a long way since it was created with contributions from folks like you @krzysztof-kubis, so thank you!

@josh-fell josh-fell merged commit 340a70b into apache:main Nov 9, 2024
56 checks passed
Copy link

boring-cyborg bot commented Nov 9, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@josh-fell
Copy link
Contributor

🎉 Congrats @krzysztof-kubis! Keep the contributions coming!

@jaklan
Copy link

jaklan commented Nov 10, 2024

Nothing wrong with incremental changes. The dbt folks were, at least peripherally, involved in the initial creation of this provider. Continued collaboration is definitely welcomed. This provider has come a long way since it was created with contributions from folks like you @krzysztof-kubis, so thank you!

As you wish, just to note it's already a breaking change - with the second approach we could avoid it by either:

  • introducing a new parameter rerun_from_failure and deprecating retry_from_failure (to differentiate dbt Cloud rerun - as that's the API endpoint - from Airflow task retry)
    or
  • keeping retry_from_failure, but supporting all options like: True | False | "never" | "always" | "when_task_retried" and a) marking True | False as deprecated b) keeping the mapping True -> "always", False -> "never"

@potiuk
Copy link
Member

potiuk commented Nov 11, 2024

🎉 Congrats @krzysztof-kubis! Keep the contributions coming!

Indeed

@potiuk
Copy link
Member

potiuk commented Nov 11, 2024

Despite some grumpy maintainers :). But you know guys, I am Polish as both of you @jaklan @krzysztof-kubis , so I had to complain.

Sorry if you were a bit too much put-off inititally, and if I was a bit too harsh, but this OSS world is a bit different than traditional and the "top-bottom" thing does not work here - much more responsibility is in the hands of contributors.

@jaklan
Copy link

jaklan commented Nov 12, 2024

Sorry if you were a bit too much put-off inititally, and if I was a bit too harsh, but this OSS world is a bit different than traditional and the "top-bottom" thing does not work here - much more responsibility is in the hands of contributors.

@potiuk Sure, we get this point, but that's exactly the reason why we had doubts about the best way to move forward. It's just a single-line change, but - breaking change. And the fact we expect the Operator to behave differently doesn't mean it's the same for other users, maybe for most of them the existing behaviour was actually fine?

That's why we wanted to ensure first there was a common agreement to change the logic, but if there is no direct ownership of the Operator - then I believe we should be even more cautious about introducing breaking changes, which could be avoided for example in one of the above ways:

  • introducing a new parameter rerun_from_failure and deprecating retry_from_failure (to differentiate dbt Cloud rerun - as that's the API endpoint - from Airflow task retry)
    or
  • keeping retry_from_failure, but supporting all options like: True | False | "never" | "always" | "when_task_retried" and a) marking True | False as deprecated b) keeping the mapping True -> "always", False -> "never"

@potiuk
Copy link
Member

potiuk commented Nov 12, 2024

@potiuk Sure, we get this point, but that's exactly the reason why we had doubts about the best way to move forward. It's just a single-line change, but - breaking change. And the fact we expect the Operator to behave differently doesn't mean it's the same for other users, maybe for most of them the existing behaviour was actually fine?

That's why we wanted to ensure first there was a common agreement to change the logic, but if there is no direct ownership of the Operator - then I believe we should be even more cautious about introducing breaking changes, which could be avoided for example in one of the above ways:

Yeah - and that's where seeing test helps and - as you see there were people who looked at it and said "yes, looks fine, let's merge it'. We have 4000+ operators in Airflow. There is no way we can be authoritative in all of them as maintainers. But you already said you feel that you would like to discuss it with dbt people so if you have another proposal after talking to them that will result in a new PR, that's cool. And if you care about this dbt operator and want to make it better - this is how you become "steward" of it, by making it better and fix problems - we as maintiners are super happy if people like you care and want to make things better (and also take more responsibility for making decisions there without having official authority). This is precisely how you might become committer here - by making decisions and implementing them and taking responsibility for them (including fixing problems when people raise them). And there are a number of places in Airflow that have their own "informal" stewards like that. And those people - when they are active in several parts of the code and are engaged, might then later be invited as committers.

This is how it works here - committer (also known as maintainer) is not a person who is "responsible" for some part of the code - this is a person who is committed and engaged in the project, but besides of being able to approve and merge the code that "looks good" and that author is confident about fixing some problems and confirming they tested it, they have no "authority" / "responsibilty" to make decisions over specific part of the code alone.

Yes. I know it's surprising that it works, but it does. Because authors like you take more responsibility for their own decisions when their code is merged and their name is attached to "I made that change and vouch for it".

The code looks good, we cannot see any big issue and we trust you care and made good decision.

Worst thing that can happen, people will downgrade the provider and we will have to revert that change if it will turn it has more problems that we foresaw by reviewing them as maintainers, no big harm done - but we trust you will test it with release candidate as well and confirm it's working.

ellisms pushed a commit to ellisms/airflow that referenced this pull request Nov 13, 2024
…3453)

* Aadded condition to check if it is a scheduled save or rerun

* Fix key name in context of task

* I added unit tests to the condition to check if it is a scheduled save or rerun

---------

Co-authored-by: RBAMOUSER\kubisk1 <[email protected]>
Co-authored-by: krzysztof-kubis <[email protected]>
Co-authored-by: krzysztof-kubis <[email protected]/>
@joellabes
Copy link

Hey! I'm the DX advocate for the orchestration team at dbt Labs.

I think this approach (only issue a rerun command if it's an Airflow retry) makes sense. It aligns with the behaviour in the run results UI for dbt Cloud, where the only way you can trigger a rerun is to have a job which has partially failed. So it should match up with folks' mental models pretty nicely.

As for the tradeoffs between the possible approaches, I think it would have been useful to flag that it was a breaking change when the PR was opened, as opposed to after the merge had happened. The initial issue enumerated the options but didn't explicitly call out the tradeoffs associated with each one.

(To be clear, I do think you made the right choice in not adding a new option/parameter, since the change aligns with the mental model of the feature in dbt Cloud so should be uncontroversial.)

Thanks for your contribution to the operator!

@jaklan
Copy link

jaklan commented Nov 13, 2024

@joellabes thanks for the input! The most confusing part here was whether we should take "UI mental model" or "API mental model":

  • UI mental model: e.g. each day we have a new scheduled run, but if some run fails - we can additionally trigger "retry from failure"
  • API mental model: run and rerun are 2 different endpoints, where rerun can also trigger the full run:

    Use this endpoint to retry a failed run for a job from the point of failure, if the run failed. Otherwise trigger a new run.

So one could argue the expected behaviour of the operator option is just to switch the endpoint which is used. We were in favor of the first approach, but we didn't want to take that decision by ourselves - now it's confirmed it's aligned with your vision as well 😉

Having said that, I wonder how we can make it more robust. With the current implementation, Airflow task can fail for whatever reason not related to dbt Cloud job (e.g. queue timeout) and be retried, which could trigger a significantly different behaviour if the previous run (e.g. from the previous day) in dbt Cloud didn't succeed - and that could lead to unexpected results.

Quick idea could be to check if there's any dbt Cloud run ID associated with the first Airflow task try and based on its existence / status (+ retry_from_failure option) decide how retry should exactly behave, but I don't recall any simple, generic way to keep state between task retries (afaik XComs are cleared). Any ideas are more than welcome.

As for the tradeoffs between the possible approaches, I think it would have been useful to flag that it was a breaking change when the PR was opened, as opposed to #43453 (comment). The initial issue enumerated the options but didn't explicitly call out the tradeoffs associated with each one.

That's true, we haven't expected things to happen so fast 😄 Especially we thought that comment would simply "freeze" the MR until we have a discussion during the today meeting, but well, it didn't 🙈

@joellabes
Copy link

which could trigger a significantly different behaviour if the previous run (e.g. from the previous day) in dbt Cloud didn't succeed - and that could lead to unexpected results.

It'll only retry if the most recent run of the job failed - if yesterday morning's run failed in dbt Cloud, and yesterday evening's run succeeded, and then this morning's run failed in Airflow, an Airflow retry will trigger a whole new run in Cloud instead of picking up from halfway through yesterday morning's Cloud run. So I think it should be ok?

@jaklan
Copy link

jaklan commented Nov 14, 2024

@joellabes

if yesterday morning's run failed in dbt Cloud, and yesterday evening's run succeeded, and then this morning's run failed in Airflow, an Airflow retry will trigger a whole new run in Cloud instead of picking up from halfway through yesterday morning's Cloud run. So I think it should be ok?

In this case - yes, but I mean a different scenario:

assume one run per day, retry_from_failure set to True, retries set to e.g. 2

  • yesterday run failed, there was no another run / rerun
  • today run was scheduled in Airflow, but it wasn't executed, because task failed due to queue timeout
  • Airflow task retry was applied
  • this time task was executed, but because it was retry - it wasn't a new run, but a retry from failure against yesterday run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants