-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression : Retry on Failure causes infinite Retries #216
Comments
Can you share the content of the log file? I'm confused about where the "out of 4" comes from. Is default_args modified anywhere else? |
You'll notice a looping pattern in the log snippet below.
|
Still an issue? |
Hii! We disabled retries until we received a fix for this. Was a fix checked in? |
Should be fixed. We use retries everywhere at Airbnb. Reopen if it's still an issue |
Yes, I just turned retries back on and it is working in 1.6.* |
I am seeing the infinite retries once again! This needs to be reopened! I'm on 1.6.1 - I specify retry limit of 1, but they keep retrying and never increasing the attempts to meet the max. FYI, I have retry set to '1'. |
We've seen this bug, the logical path to it is when a retry would get queued, it wouldn't get incremented |
I was looking at this today and confirmed the issue was only appearing when we had a pool defined. Thanks for the fix! |
I am seeing infinite retries with pools using the current bleeding-edge repo(0f28090). You can replicate this issue with the following dag, which assumes the existence of a pool called "TrivialTasks":
This is be related to #1225, except that there's infinite retry even without using |
`SchedulerJob` contains a `set` of `TaskInstance`s called `queued_tis`. `SchedulerJob.process_events` loops through `queued_tis` and tries to remove completed tasks. However, without customizing `__eq__` and `__hash__`, the following two lines have no effect, never removing elements from `queued_tis` leading to infinite retries on failure. This is related to my comment on apache#216. The following code was introduced in the fix to apache#1225. ``` elif ti in self.queued_tis: self.queued_tis.remove(ti) ````
* wip spark configuration * fixup depends on init action * fixup use staging bucket * fixup! docs, volumes and init action bug * spark tasks use ccache cluster policy rule * use 1.10.x operator path * Attempt to fix the image * Update terraform/modules/airflow_tenant/modules/airflow_app/main.tf Co-authored-by: Kamil Breguła <[email protected]> * Update sfdc-airflow-aas/sfdc_airflow/cluster_policy/rules.py Co-authored-by: Kamil Breguła <[email protected]> * improve hadoop config organization on GCS * set core / yarn configmaps * escape commas to make helm happy * improve spark logging, add docs * revert log4j * fix env var name * fix leading newline in hadoop configs * fix yarn site in configmap * remove duplicate conf in exported gcs path * Update subrepos/airflow/chart/templates/workers/worker-deployment.yaml * Update subrepos/airflow/chart/templates/workers/worker-deployment.yaml * add back log4j * working demo * refactor WI to manage annotations in helm * Add spark provider package * wip * fix numbers add dive * add deploying iac docs * allow arbitrary annotations * Improve helm chart annotations * Nest service accounts under worker, webserver, scheduler * Update values.schema.json * fix verify.sh * fix gcs connector verification * Switch to CloudSQL with mutual SSL added in PGBouncer * tf docs * improve gpc infra network deps * remove errant comma in values.schema.json * fix helm linting Co-authored-by: Kamil Breguła <[email protected]> Co-authored-by: Kamil Breguła <[email protected]> Co-authored-by: Jarek Potiuk <[email protected]>
…cause we have it in gradle build (apache#216) Signed-off-by: olek <[email protected]> Co-authored-by: Michael Collado <[email protected]>
I pulled the latest code from git/master and set it up. I have observed a regression now on 3 separate occasions.
Here are the default args that I am using and passing to my code.
I keep getting this exception every few minutes... It doesn't advance the counter to 2 out of 4. This worked fine until a few days ago, after I pulled the latest code.
The task failing is
The text was updated successfully, but these errors were encountered: