-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to diagnose jobs seemingly not retrying #97
Comments
Are you using a failure backend? Logging can be supported by use of a failure backend (part of Resque). We can bat this around on GitHub issues or if you use IRC/some-other-IM I'd be happy to chat/help where possible. Luke |
You can find me on Freenode and Quakenet under the user |
Versions:
Ruby 2.1, and our config looks like so: require 'resque/failure/redis'
Resque::Failure::MultipleWithRetrySuppression.classes = [Resque::Failure::Redis]
Resque::Failure.backend = Resque::Failure::MultipleWithRetrySuppression
Resque::Plugins::Status::Hash.expire_in = 24.hours.to_i
Resque.logger.level = Logger::INFO I won't have good access to a chat-like thing today, but maybe Monday. Thanks for anything you can tell me! |
@lantins i'm around as well if you need help with anything On Saturday, April 19, 2014, David Copeland [email protected]
|
@jzaleski Hey :) Any assistance from your good self is very welcomed! @davetron5000 Couple of things:
|
Most jobs don't use status, and we do see these same failures on apps w/out that plugin. Pretty much all our jobs use resque-retry. resque-web is pretty much useless for us because we have thousands of scheduled jobs at any given time (over 50 pages), and it's hard to predict when a job will fail, much less find it in the web UI. 99.9% never fail, and of those that do, almost all are because of SIGTERM and can be retried (which is why I've been bumping the number of retries up so high). That's why I like logging, because I search what happened after the fact. Given console access, Is there an easy way to access jobs in the schedule that are there for retry reasons? I can manually check at different times or even run a script to check and report back. Alternatively, is there a way to get access to the number of times a job has retried? e.g. in the job code perhaps? I can use that to add logging as well. |
Re: console poking about All data is stored in Redis, if you wish to poke about look at the following to figure out what keys to look at: Re: access to retry count in code All resque-retry internals can be accessed by your job code, you can access the number of times a job has retried with the Depending on the number of jobs; I'd be tempted to simply have a job log some timestamp/retry information to a file that you can later grep/paw over. If you have a larger number of jobs perhaps only start doing this after the 15th retry? Other If you've not already read it, this blog post may prove useful: http://hone.heroku.com/resque/2012/08/21/resque-signals.html |
Thanks, good info. I will poke around and see what I can see. I tried some of this out locally and it does seem to be working, so I can at least add some logging. My fear is that the jobs are failing outside the resque hook/perform cycle meaning resque-retry isn't getting a hold of these failures. |
I have a gut feeling the blog post will help if you were not aware of the signal handling change, there are a number of Resque issues related to Heroku/SIGTERM, this change addresses those issues. Please let us know what you find. |
Thanks @lantins that is exactly the resource I was going to point him at. @davetron5000 two quick questions.. When you start a worker, what options On Saturday, April 19, 2014, Luke Antins
|
Our invocations generally look like this:
We've been setting |
@jzaleski Hoping you have some more ideas! :) |
I'm gonna run our code today with Thanks again for the help. Will report back in a few hours… |
Sounds good @davetron5000, the more information the better. I was planning On Monday, April 21, 2014, David Copeland [email protected] wrote:
|
OK, so it happened again this morning (finally :). Here's the log excerpt:
Observations:
It seems like its either dying in the after_fork hook, or, it got an exception before calling Does this point us somewhere more specific? |
OK, more info. Here's the complete log message from a different failed job, noise stripped
All that I'm guess since Resque workers are short-lived, this is causing a problem, which I will now explain in a timeline, so that someone else can agree with me:
Seems like launching threads from a resque job that are intended to outlive the forked worker process is likely a bad idea? I will attempt to reconfigure this stuff without spawning threads and see if that changes things. |
Thanks for the information, it seems your figuring it out quite fine by yourself :) Yes, the threads associated with the worker cannot outlive the parent process. If the processing (sending the message) happens in another thread/process from the worker it seems sensible the worker won't know anything about any failures, so the plugin would be useless. Unless the process fails before your thread is spawned; it will never trigger a retry. I'd suggest sending the message within the same process/thread as the main worker. If your timeline is correct; the |
OK, more info now. sucker punch, celluloid, and new threads removed. What I'm seeing now is the job getting an exception in what I think is the # Resque::Worker#work excerpt
else # this block is the childprocess
unregister_signal_handlers if will_fork? && term_child
begin
reconnect
perform(job, &block) # <=== logs indicate this is not being called
rescue Exception => exception
report_failed_job(job,exception)
end
# ... Here's what I see in my logs. First, we see the job get picked up:
worker is Next, we see that same worker and PID run the
The first line of my
The timestamp of this failure is the same as what we see in resque-web: I believe this is raised from one of the signal handlers, due to the string "SIGTERM" being the only message: # Resque::Worker
def unregister_signal_handlers
trap('TERM') do
trap ('TERM') do
# ignore subsequent terms
end
raise TermException.new("SIGTERM") # <===========
end
trap('INT', 'DEFAULT')
begin
trap('QUIT', 'DEFAULT')
trap('USR1', 'DEFAULT')
trap('USR2', 'DEFAULT')
rescue ArgumentError
end
end Around 25 seconds later, we see this: (note that the job is configured to retry after 60 seconds): Job is picked up, this time by a different worker,
Job is now fully executed:
Here is where the job's
I have verified in our logs that the code that queues this job occured only once, i.e. there were not two jobs with the same args queued. So, there are two bits of odd behavior:
Job being retried soonerMy only explanation is that somehow, resque-retry is getting a def try_again(exception, *args)
temp_retry_delay = ([-1, 1].include?(method(:retry_delay).arity) ? retry_delay(exception.class) : retry_delay)
retry_in_queue = retry_job_delegate ? retry_job_delegate : self
if temp_retry_delay <= 0 # <==============
Resque.enqueue(retry_in_queue, *args_for_retry(*args))
else
Resque.enqueue_in(temp_retry_delay, retry_in_queue, *args_for_retry(*args))
end
# ...
end In console the class returns a method with arity -1, so I don't see how this could happen, but since there's no logging about what's going on, it's hard to be sure. It certainly seems possible that this could code run before Job being put into the failed queue AND retriedThe only way I can explain this would be this code in the failure backend: def save
if !(retryable? && retrying?)
cleanup_retry_failure_log!
super
elsif retry_delay > 0 If, for some reason, And this could happen if the Since there's no logging there's no way to know for sure, but this long-winded explanation fits the facts. Fixing itWhat seemed odd about my journey through the code was that we schedule the job for retry in one place ( Is there a reason for this? Or should we move this code from retry_key = redis_retry_key(*args)
Resque.redis.setnx(retry_key, -1) # default to -1 if not set. |
Awesome stuff @davetron5000! It seems like you may be on the right track. I will give your Fixing it suggestion some consideration. In the mean-time, if you'd like, add some logging to the gem and submit a pull-request (I am sure the community would appreciate it -- I will happily merge it). |
PR submitted. I'm going to try to run my branch of this in production and see what it tells us. |
@davetron5000 Have you tried your suggested fix? If so anything to report? If all the tests pass with the suggested fix, I see no reason why we can't merge, but I would like to look at this a little closer (perhaps once we've got some data from your extra logging?). |
Deployed #99 to production yesterday. Unfortunately I didn't have logging inside the failure backend, so I'm still getting an incomplete picture. Also, the log messages are out of order, but I'm going to blame papertrail for that. I've rearranged them in the order they were logged Job gets picked up:
We see the new logging from resque-retry:
This is where the story ends, unfortunately. It at least confirms that the logic inside the plugin itself is acting as expected, which blows my theory that the retry delay was becoming 0 through some race condition. I'm going to deploy my updated fork that has logging in the failure backend and we can see where this left off. Also seems that resque scheduler does some logging which I will enable—it could be the culprit here, too. As before, the job is run again a few seconds later:
|
Okay, keep us updated, as I'm sure you will! |
tl;drThe plugin assumes that every job's I don't know what the solution is, but having The detailsOK, I believe my theory is confirmed. Here's the log from the plugin:
And here's the log from the failure back-end:
The second line is from this code: def save
log 'failure backend save', payload, exception
if !(retryable? && retrying?)
log "!(#{retryable?} && #{retryable? && retrying?}) - sending failure to superclass", payload, exception
cleanup_retry_failure_log!
super This means that def retrying?
Resque.redis.exists(retry_key)
end And def retry_key
klass.redis_retry_key(*payload['args'])
end
So, what this tells me is that when the failure back-end is figuring out what to do, the plugin has not yet placed the job's retry information into Redis, so the failure backend figures the job isn't being retried and defers up the chain (which, in my configuration, places the job on the failed queue). Since begin
reconnect
perform(job, &block)
rescue Exception => exception
report_failed_job(job,exception)
end In that case, def before_perform_retry(*args)
log 'before_perform_retry', args
@on_failure_retry_hook_already_called = false
# store number of retry attempts.
retry_key = redis_retry_key(*args)
Resque.redis.setnx(retry_key, -1) # <=======================
@retry_attempt = Resque.redis.incr(retry_key)
log "attempt: #{@retry_attempt} set in Redis", args
@retry_attempt
end So, back to def fail(exception)
run_failure_hooks(exception)
Failure.create \
:payload => payload,
:exception => exception,
:worker => worker,
:queue => queue
end
So, it seems that the intended happy path for the plugin would be as follows:
What's happening in my case, is this:
Whew! All that to say that the retry logic assumes that we get as far as perform, or it doesn't work. Solving this is not as clear to me. We can't move the setting of |
@davetron5000 are you working on a fix for this? Should I expect a pull-request? Just trying to ensure that this does not fall off the grid. |
I am going to submit a pull request for this, just have been sidetracked. I think you are good to wait until you hear from me |
Alrighty. Thanks for the the update. On Fri, May 2, 2014 at 12:23 PM, David Copeland [email protected]:
|
My proposed fix is in #100 |
We are on Heroku and, due to daily (or more) dyno restarts, a lot of our jobs fail with
SIGTERM
. Almost all of these jobs can be retried and we've set up resque-retry.We still noticed failulres, so for a particular job, we set the job up like so:
We still see consistent failures from this job. I just can't believe that it failed twenty times in a row over 20 minutes, each time with a SIGTERM.
This leads me to suspect that the job is not being retried at all.
How can I confirm this? Or, how can I examine what Resque retry is doing when looking at logs? I don't see any logging info in the source code—is there some I'm missing and should expect to be there?
Sorry for using issues for support—please let me know if there's a better place to ask this.
The text was updated successfully, but these errors were encountered: