-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get the retry_attempt for DirtyExit failures, where perform isn't called. #61
Get the retry_attempt for DirtyExit failures, where perform isn't called. #61
Conversation
+1 for this, We have to do something here. I tested a reboot of our resque-server without shutting the workers down nicely first. I got 10 jobs done twice and a few even more often, they got processed up to 10 times. |
I've been trying to write a test case for this issue, but with very little luck. Could either of you guys help me out with this? I'm tempted to merge anyhow, it looks sane and the explanation makes sense, but I'd really like to cover it with a test. |
I added a test, although it requires the resque to notify the failure hooks when Resque::Job#fail is called, which was added in defunkt/resque@b3cdd32 and first released in resque version 1.20.0. As a result, I prevented the test from being defined when using an older version of resque. |
@lantins any feedback on the test I added to this pull request? |
@lantins would be great to get this merged. |
Really ! It'd great. Plz @lantins ? Or may we fork you maybe ? |
Any status on this? I noticed one of these in my failure queue and was investigating and found this. |
Any chance this change or a similar change can get merged? This appears to be an issue still |
@lantins would be great if you could merge this one. |
any update on this? |
Guys, just to let you know my OSS projects will get some much needed love and attention this weekend. |
@dylanahsmith Thanks for providing the test to go with your fix. |
@lantins ping |
@dylanahsmith responded w/ one suggestion. Otherwise, this looks good to me. |
@jzaleski made the change to use |
Thank you. I plan to give the change one more quick review this morning. On Wednesday, April 23, 2014, Dylan Thacker-Smith [email protected]
|
The problem has similarities, but the DirtyExit exception needs to be handled differently since it is running in the worker process, not the child job process, so instance variables may be set from a previous killed job getting pruned and need to be ignored for this case. I don't mind helping with the other issue, but I don't want to complicate this pull request by handling this completely different code path. |
@lantins let's finish out #97 and then turn to this. It seems like @davetron5000 is pretty close w/ his logging additions from #99 |
The test case provided passes (with slight exception of |
👍 Would love to see this merged. |
I would really like to try to get to the bottom of this one. It's not clear to me whether this was resolved by another pull-request or is still an issue on the latest version of the @orenmazor @jstorimer are you still seeing this issue on version 1.2.1? If you're not on 1.2.1, what version are you running? |
Also, please rebase/remerge master into this and adjust your changes as necessary. If you can run w/ a git-install of the gem for a bit you should be able to gather some good information out of the logs for reference (just make sure you set the log-level, for your worker, to debug or higher). |
…led. @retry_attempt depends on perform being called, but perform will be called after forking and set in the child job process. If there is a crash that prevents the job from notifying the failure hooks itself, then a DirtyExit exception will be given to the on_failure hooks from a worker processes, where @retry_attempt will not have been set. The @on_failure_retry_hook_already_called variable also needs to be ignored, because this is set to false in the before_perform hook.
done
It looks like the issues hasn't been addressed yet, and the regression test from this pull request fails without the code changes. |
@dylanahsmith you are a hero for seeing this through! Thanks |
Get the retry_attempt for DirtyExit failures, where perform isn't called.
@dylanahsmith New gem published |
Problem:
A job can get infinitely retried through Resque::Worker#prune_dead_workers, using the first retry_delay of the first failure. Even worse, if resque pull request #623 gets accepted, this will also happen when a process killed with
kill -9
.Diagnosis:
@retry_attempt
depends on the before_perform hooks being called, but before_perform hooks will be called after forking and set in the child job process. If there is a crash that prevents the job from notifying the failure hooks itself, then a DirtyExit exception will be given to the on_failure hooks from a worker processes, where@retry_attempt
will not have been set.The
@on_failure_retry_hook_already_called
variable also needs to be ignored, because this is set to false in the before_perform hook.Solution:
The only situation where the on_failure hooks are called outside for the the forked job process is when a DirtyExit exception is being passed to the on_failure hooks. Therefore, I used this to detect this code path, and explicitly set
@retry_attempt
explicitly. The@on_failure_retry_hook_already_called
variable is ignored in this case because it is unreliable without the before_perform hook getting called.This could be a lot cleaner if plugins were called on an instance rather than a class, but that is a poor design decision in
resque
, which will likely be kept for backwards compatiblity.