Get the retry_attempt for DirtyExit failures, where perform isn't called. #61

dylanahsmith · 2012-06-24T17:25:35Z

Problem:

A job can get infinitely retried through Resque::Worker#prune_dead_workers, using the first retry_delay of the first failure. Even worse, if resque pull request #623 gets accepted, this will also happen when a process killed with kill -9.

Diagnosis:

@retry_attempt depends on the before_perform hooks being called, but before_perform hooks will be called after forking and set in the child job process. If there is a crash that prevents the job from notifying the failure hooks itself, then a DirtyExit exception will be given to the on_failure hooks from a worker processes, where @retry_attempt will not have been set.

The @on_failure_retry_hook_already_called variable also needs to be ignored, because this is set to false in the before_perform hook.

Solution:

The only situation where the on_failure hooks are called outside for the the forked job process is when a DirtyExit exception is being passed to the on_failure hooks. Therefore, I used this to detect this code path, and explicitly set @retry_attempt explicitly. The @on_failure_retry_hook_already_called variable is ignored in this case because it is unreliable without the before_perform hook getting called.

This could be a lot cleaner if plugins were called on an instance rather than a class, but that is a poor design decision in resque, which will likely be kept for backwards compatiblity.

ticktricktrack · 2012-08-13T16:40:01Z

+1 for this, We have to do something here.

I tested a reboot of our resque-server without shutting the workers down nicely first. I got 10 jobs done twice and a few even more often, they got processed up to 10 times.

lantins · 2012-09-07T15:50:58Z

I've been trying to write a test case for this issue, but with very little luck.

Could either of you guys help me out with this?

I'm tempted to merge anyhow, it looks sane and the explanation makes sense, but I'd really like to cover it with a test.

dylanahsmith · 2012-10-10T15:51:14Z

I added a test, although it requires the resque to notify the failure hooks when Resque::Job#fail is called, which was added in defunkt/resque@b3cdd32 and first released in resque version 1.20.0. As a result, I prevented the test from being defined when using an older version of resque.

dylanahsmith · 2012-12-18T16:50:38Z

@lantins any feedback on the test I added to this pull request?

airhorns · 2013-04-03T18:41:18Z

@lantins would be great to get this merged.

thedamfr · 2013-04-06T04:33:55Z

Really ! It'd great. Plz @lantins ? Or may we fork you maybe ?

davidguthu · 2013-05-15T13:52:39Z

Any status on this? I noticed one of these in my failure queue and was investigating and found this.

pebrinic · 2013-10-23T20:13:42Z

Any chance this change or a similar change can get merged? This appears to be an issue still

knaidu · 2013-10-28T22:25:12Z

@lantins would be great if you could merge this one.

arthurnn · 2013-10-30T20:47:15Z

any update on this?

lantins · 2013-11-01T12:19:20Z

Guys, just to let you know my OSS projects will get some much needed love and attention this weekend.

lantins · 2013-11-01T12:20:47Z

@dylanahsmith Thanks for providing the test to go with your fix.

dylanahsmith · 2014-04-23T20:14:06Z

@lantins ping

jzaleski · 2014-04-24T01:52:27Z

@dylanahsmith responded w/ one suggestion. Otherwise, this looks good to me.

dylanahsmith · 2014-04-24T02:36:21Z

@jzaleski made the change to use is_a? instead of kind_of?

jzaleski · 2014-04-24T09:57:52Z

Thank you. I plan to give the change one more quick review this morning.
Stay tuned!

On Wednesday, April 23, 2014, Dylan Thacker-Smith [email protected]
wrote:

@jzaleski https://github.com/jzaleski made the change to use is_a?instead of
kind_of?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/61#issuecomment-41237168
.

lantins · 2014-04-25T12:36:35Z

@jzaleski We need to consider how this relates to #97 - it seems to be kinda the same problem?

dylanahsmith · 2014-04-25T15:13:27Z

The problem has similarities, but the DirtyExit exception needs to be handled differently since it is running in the worker process, not the child job process, so instance variables may be set from a previous killed job getting pruned and need to be ignored for this case.

I don't mind helping with the other issue, but I don't want to complicate this pull request by handling this completely different code path.

jzaleski · 2014-04-29T01:08:54Z

@lantins let's finish out #97 and then turn to this. It seems like @davetron5000 is pretty close w/ his logging additions from #99

lantins · 2014-05-19T22:19:25Z

The test case provided passes (with slight exception of RetryKilledJob.expects(:clean_retry_key).once) with changes in #100

orenmazor · 2014-06-25T14:13:31Z

@lantins @jzaleski hey guys, looks like #100 is merged now. can I help out with getting this task wrapped up?

jstorimer · 2014-07-16T02:51:37Z

👍 Would love to see this merged.

jzaleski · 2014-07-18T01:05:51Z

I would really like to try to get to the bottom of this one. It's not clear to me whether this was resolved by another pull-request or is still an issue on the latest version of the resque-retry gem.

@orenmazor @jstorimer are you still seeing this issue on version 1.2.1? If you're not on 1.2.1, what version are you running?

jzaleski · 2014-07-18T01:13:18Z

Also, please rebase/remerge master into this and adjust your changes as necessary.

If you can run w/ a git-install of the gem for a bit you should be able to gather some good information out of the logs for reference (just make sure you set the log-level, for your worker, to debug or higher).

…led. @retry_attempt depends on perform being called, but perform will be called after forking and set in the child job process. If there is a crash that prevents the job from notifying the failure hooks itself, then a DirtyExit exception will be given to the on_failure hooks from a worker processes, where @retry_attempt will not have been set. The @on_failure_retry_hook_already_called variable also needs to be ignored, because this is set to false in the before_perform hook.

dylanahsmith · 2014-08-12T22:25:26Z

please rebase/remerge master into this and adjust your changes as necessary

done

It's not clear to me whether this was resolved by another pull-request or is still an issue on the latest version of the resque-retry gem.

It looks like the issues hasn't been addressed yet, and the regression test from this pull request fails without the code changes.

lantins · 2014-08-12T22:59:12Z

@dylanahsmith you are a hero for seeing this through! Thanks

Get the retry_attempt for DirtyExit failures, where perform isn't called.

lantins · 2014-08-13T01:48:32Z

@dylanahsmith New gem published

jzaleski mentioned this pull request Apr 24, 2014

set a TTL on retry keys #98

Merged

lantins added Needs Discussion labels Apr 25, 2014

lantins added this to the v1.2.0 milestone Apr 25, 2014

lantins modified the milestones: v1.3.0, v1.2.0 May 19, 2014

dylanahsmith added 2 commits August 12, 2014 15:17

Add test for retry attempt being accurate for dirty exit failures.

6f65816

lantins removed this from the v1.3.0 milestone Aug 12, 2014

lantins added a commit that referenced this pull request Aug 12, 2014

Merge pull request #61 from dylanahsmith/dirty-exit-retry-attempt

4e46725

Get the retry_attempt for DirtyExit failures, where perform isn't called.

lantins merged commit 4e46725 into lantins:master Aug 12, 2014

dylanahsmith deleted the dirty-exit-retry-attempt branch August 12, 2014 23:35

lantins mentioned this pull request Aug 4, 2015

Unreleased locks when Resque terminates using the loner option lantins/resque-lock-timeout#26

Closed

tjsousa mentioned this pull request Sep 7, 2015

Explicitly remove lock upon job's dirty exit using worker's on failure hook lantins/resque-lock-timeout#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get the retry_attempt for DirtyExit failures, where perform isn't called. #61

Get the retry_attempt for DirtyExit failures, where perform isn't called. #61

dylanahsmith commented Jun 24, 2012

ticktricktrack commented Aug 13, 2012

lantins commented Sep 7, 2012

dylanahsmith commented Oct 10, 2012

dylanahsmith commented Dec 18, 2012

airhorns commented Apr 3, 2013

thedamfr commented Apr 6, 2013

davidguthu commented May 15, 2013

pebrinic commented Oct 23, 2013

knaidu commented Oct 28, 2013

arthurnn commented Oct 30, 2013

lantins commented Nov 1, 2013

lantins commented Nov 1, 2013

dylanahsmith commented Apr 23, 2014

jzaleski commented Apr 24, 2014

dylanahsmith commented Apr 24, 2014

jzaleski commented Apr 24, 2014

lantins commented Apr 25, 2014

dylanahsmith commented Apr 25, 2014

jzaleski commented Apr 29, 2014

lantins commented May 19, 2014

orenmazor commented Jun 25, 2014

jstorimer commented Jul 16, 2014

jzaleski commented Jul 18, 2014

jzaleski commented Jul 18, 2014

dylanahsmith commented Aug 12, 2014

lantins commented Aug 12, 2014

lantins commented Aug 13, 2014

Get the retry_attempt for DirtyExit failures, where perform isn't called. #61

Get the retry_attempt for DirtyExit failures, where perform isn't called. #61

Conversation

dylanahsmith commented Jun 24, 2012

ticktricktrack commented Aug 13, 2012

lantins commented Sep 7, 2012

dylanahsmith commented Oct 10, 2012

dylanahsmith commented Dec 18, 2012

airhorns commented Apr 3, 2013

thedamfr commented Apr 6, 2013

davidguthu commented May 15, 2013

pebrinic commented Oct 23, 2013

knaidu commented Oct 28, 2013

arthurnn commented Oct 30, 2013

lantins commented Nov 1, 2013

lantins commented Nov 1, 2013

dylanahsmith commented Apr 23, 2014

jzaleski commented Apr 24, 2014

dylanahsmith commented Apr 24, 2014

jzaleski commented Apr 24, 2014

lantins commented Apr 25, 2014

dylanahsmith commented Apr 25, 2014

jzaleski commented Apr 29, 2014

lantins commented May 19, 2014

orenmazor commented Jun 25, 2014

jstorimer commented Jul 16, 2014

jzaleski commented Jul 18, 2014

jzaleski commented Jul 18, 2014

dylanahsmith commented Aug 12, 2014

lantins commented Aug 12, 2014

lantins commented Aug 13, 2014