Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the retry_attempt for DirtyExit failures, where perform isn't called. #61

Merged
merged 2 commits into from
Aug 12, 2014

Conversation

dylanahsmith
Copy link
Contributor

Problem:

A job can get infinitely retried through Resque::Worker#prune_dead_workers, using the first retry_delay of the first failure. Even worse, if resque pull request #623 gets accepted, this will also happen when a process killed with kill -9.

Diagnosis:

@retry_attempt depends on the before_perform hooks being called, but before_perform hooks will be called after forking and set in the child job process. If there is a crash that prevents the job from notifying the failure hooks itself, then a DirtyExit exception will be given to the on_failure hooks from a worker processes, where @retry_attempt will not have been set.

The @on_failure_retry_hook_already_called variable also needs to be ignored, because this is set to false in the before_perform hook.

Solution:

The only situation where the on_failure hooks are called outside for the the forked job process is when a DirtyExit exception is being passed to the on_failure hooks. Therefore, I used this to detect this code path, and explicitly set @retry_attempt explicitly. The @on_failure_retry_hook_already_called variable is ignored in this case because it is unreliable without the before_perform hook getting called.

This could be a lot cleaner if plugins were called on an instance rather than a class, but that is a poor design decision in resque, which will likely be kept for backwards compatiblity.

@ticktricktrack
Copy link

+1 for this, We have to do something here.

I tested a reboot of our resque-server without shutting the workers down nicely first. I got 10 jobs done twice and a few even more often, they got processed up to 10 times.

@lantins
Copy link
Owner

lantins commented Sep 7, 2012

I've been trying to write a test case for this issue, but with very little luck.

Could either of you guys help me out with this?

I'm tempted to merge anyhow, it looks sane and the explanation makes sense, but I'd really like to cover it with a test.

@dylanahsmith
Copy link
Contributor Author

I added a test, although it requires the resque to notify the failure hooks when Resque::Job#fail is called, which was added in defunkt/resque@b3cdd32 and first released in resque version 1.20.0. As a result, I prevented the test from being defined when using an older version of resque.

@dylanahsmith
Copy link
Contributor Author

@lantins any feedback on the test I added to this pull request?

@airhorns
Copy link

airhorns commented Apr 3, 2013

@lantins would be great to get this merged.

@thedamfr
Copy link

thedamfr commented Apr 6, 2013

Really ! It'd great. Plz @lantins ? Or may we fork you maybe ?

@davidguthu
Copy link

Any status on this? I noticed one of these in my failure queue and was investigating and found this.

@pebrinic
Copy link

Any chance this change or a similar change can get merged? This appears to be an issue still

@knaidu
Copy link

knaidu commented Oct 28, 2013

@lantins would be great if you could merge this one.

@arthurnn
Copy link

any update on this?

@lantins
Copy link
Owner

lantins commented Nov 1, 2013

Guys, just to let you know my OSS projects will get some much needed love and attention this weekend.

@lantins
Copy link
Owner

lantins commented Nov 1, 2013

@dylanahsmith Thanks for providing the test to go with your fix.

@dylanahsmith
Copy link
Contributor Author

@lantins ping

@jzaleski
Copy link
Collaborator

@dylanahsmith responded w/ one suggestion. Otherwise, this looks good to me.

@jzaleski jzaleski mentioned this pull request Apr 24, 2014
@dylanahsmith
Copy link
Contributor Author

@jzaleski made the change to use is_a? instead of kind_of?

@jzaleski
Copy link
Collaborator

Thank you. I plan to give the change one more quick review this morning.
Stay tuned!

On Wednesday, April 23, 2014, Dylan Thacker-Smith [email protected]
wrote:

@jzaleski https://github.com/jzaleski made the change to use is_a?instead of
kind_of?


Reply to this email directly or view it on GitHubhttps://github.com//pull/61#issuecomment-41237168
.

@lantins
Copy link
Owner

lantins commented Apr 25, 2014

@jzaleski We need to consider how this relates to #97 - it seems to be kinda the same problem?

@lantins lantins added this to the v1.2.0 milestone Apr 25, 2014
@dylanahsmith
Copy link
Contributor Author

The problem has similarities, but the DirtyExit exception needs to be handled differently since it is running in the worker process, not the child job process, so instance variables may be set from a previous killed job getting pruned and need to be ignored for this case.

I don't mind helping with the other issue, but I don't want to complicate this pull request by handling this completely different code path.

@jzaleski
Copy link
Collaborator

@lantins let's finish out #97 and then turn to this. It seems like @davetron5000 is pretty close w/ his logging additions from #99

@lantins
Copy link
Owner

lantins commented May 19, 2014

The test case provided passes (with slight exception of RetryKilledJob.expects(:clean_retry_key).once) with changes in #100

@lantins lantins modified the milestones: v1.3.0, v1.2.0 May 19, 2014
@orenmazor
Copy link
Contributor

@lantins @jzaleski hey guys, looks like #100 is merged now. can I help out with getting this task wrapped up?

@jstorimer
Copy link
Contributor

👍 Would love to see this merged.

@jzaleski
Copy link
Collaborator

I would really like to try to get to the bottom of this one. It's not clear to me whether this was resolved by another pull-request or is still an issue on the latest version of the resque-retry gem.

@orenmazor @jstorimer are you still seeing this issue on version 1.2.1? If you're not on 1.2.1, what version are you running?

@jzaleski
Copy link
Collaborator

Also, please rebase/remerge master into this and adjust your changes as necessary.

If you can run w/ a git-install of the gem for a bit you should be able to gather some good information out of the logs for reference (just make sure you set the log-level, for your worker, to debug or higher).

…led.

@retry_attempt depends on perform being called, but perform will be called
after forking and set in the child job process. If there is a crash that
prevents the job from notifying the failure hooks itself, then a DirtyExit
exception will be given to the on_failure hooks from a worker processes,
where @retry_attempt will not have been set.

The @on_failure_retry_hook_already_called variable also needs to be
ignored, because this is set to false in the before_perform hook.
@dylanahsmith
Copy link
Contributor Author

please rebase/remerge master into this and adjust your changes as necessary

done

It's not clear to me whether this was resolved by another pull-request or is still an issue on the latest version of the resque-retry gem.

It looks like the issues hasn't been addressed yet, and the regression test from this pull request fails without the code changes.

@lantins lantins removed this from the v1.3.0 milestone Aug 12, 2014
@lantins
Copy link
Owner

lantins commented Aug 12, 2014

@dylanahsmith you are a hero for seeing this through! Thanks

lantins added a commit that referenced this pull request Aug 12, 2014
Get the retry_attempt for DirtyExit failures, where perform isn't called.
@lantins lantins merged commit 4e46725 into lantins:master Aug 12, 2014
@dylanahsmith dylanahsmith deleted the dirty-exit-retry-attempt branch August 12, 2014 23:35
@lantins
Copy link
Owner

lantins commented Aug 13, 2014

@dylanahsmith New gem published

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.