Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to diagnose jobs seemingly not retrying #97

Closed
davetron5000 opened this issue Apr 19, 2014 · 28 comments
Closed

How to diagnose jobs seemingly not retrying #97

davetron5000 opened this issue Apr 19, 2014 · 28 comments

Comments

@davetron5000
Copy link
Contributor

We are on Heroku and, due to daily (or more) dyno restarts, a lot of our jobs fail with SIGTERM. Almost all of these jobs can be retried and we've set up resque-retry.

We still noticed failulres, so for a particular job, we set the job up like so:

class FireItemEventJob
  extend Resque::Plugins::Retry

  @queue = :fire_item_event
  @retry_limit = 20
  @retry_delay = 60 #seconds

  def self.perform(item_id,event_name,current_user_id=nil,new_hizzy_id=nil)
    # ...
  end
end

We still see consistent failures from this job. I just can't believe that it failed twenty times in a row over 20 minutes, each time with a SIGTERM.

This leads me to suspect that the job is not being retried at all.

How can I confirm this? Or, how can I examine what Resque retry is doing when looking at logs? I don't see any logging info in the source code—is there some I'm missing and should expect to be there?

Sorry for using issues for support—please let me know if there's a better place to ask this.

@lantins
Copy link
Owner

lantins commented Apr 19, 2014

Hi @davetron5000

Are you using a failure backend?
What version of Ruby, Resque and resque-retry are you using?

Logging can be supported by use of a failure backend (part of Resque).

We can bat this around on GitHub issues or if you use IRC/some-other-IM I'd be happy to chat/help where possible.

Luke

@lantins
Copy link
Owner

lantins commented Apr 19, 2014

You can find me on Freenode and Quakenet under the user lantins.

@davetron5000
Copy link
Contributor Author

Versions:

resque (1.25.2)
resque-retry (1.1.4)
resque-scheduler (2.5.5)
resque-status (0.4.2)
resque_mailer (2.2.6)

Ruby 2.1, and our config looks like so:

require 'resque/failure/redis'
Resque::Failure::MultipleWithRetrySuppression.classes = [Resque::Failure::Redis]
Resque::Failure.backend = Resque::Failure::MultipleWithRetrySuppression

Resque::Plugins::Status::Hash.expire_in = 24.hours.to_i

Resque.logger.level = Logger::INFO

I won't have good access to a chat-like thing today, but maybe Monday. Thanks for anything you can tell me!

@jzaleski
Copy link
Collaborator

@lantins i'm around as well if you need help with anything

On Saturday, April 19, 2014, David Copeland [email protected]
wrote:

Versions:

resque (1.25.2)
resque-retry (1.1.4)
resque-scheduler (2.5.5)
resque-status (0.4.2)
resque_mailer (2.2.6)

Ruby 2.1, and our config looks like so:

require 'resque/failure/redis'Resque::Failure::MultipleWithRetrySuppression.classes = [Resque::Failure::Redis]Resque::Failure.backend = Resque::Failure::MultipleWithRetrySuppression
Resque::Plugins::Status::Hash.expire_in = 24.hours.to_i
Resque.logger.level = Logger::INFO

I won't have good access to a chat-like thing today, but maybe Monday.
Thanks for anything you can tell me!


Reply to this email directly or view it on GitHubhttps://github.com//issues/97#issuecomment-40878720
.

@lantins
Copy link
Owner

lantins commented Apr 19, 2014

@jzaleski Hey :) Any assistance from your good self is very welcomed!

@davetron5000 Couple of things:

  1. Retries should be working, but that doesn't mean they are!
  2. Your code snippets look fine to me.
  3. I've never used the Status plugin along side Retry; this may be worth a closer look or taking out the loop to see what happens.
  4. Okay, you are using our failure backend; login to the web interface and watch the failing jobs. Do they fail after 20 mins? or do they instantly use up the 20 retry attempts?
  5. If possible; run the unit tests on your Heroku stack and see if anything is flagged up.
  6. Are you using resque-retry with any other jobs? Do they have similar problems?

@davetron5000
Copy link
Contributor Author

Most jobs don't use status, and we do see these same failures on apps w/out that plugin. Pretty much all our jobs use resque-retry.

resque-web is pretty much useless for us because we have thousands of scheduled jobs at any given time (over 50 pages), and it's hard to predict when a job will fail, much less find it in the web UI. 99.9% never fail, and of those that do, almost all are because of SIGTERM and can be retried (which is why I've been bumping the number of retries up so high). That's why I like logging, because I search what happened after the fact.

Given console access, Is there an easy way to access jobs in the schedule that are there for retry reasons? I can manually check at different times or even run a script to check and report back.

Alternatively, is there a way to get access to the number of times a job has retried? e.g. in the job code perhaps? I can use that to add logging as well.

@lantins
Copy link
Owner

lantins commented Apr 19, 2014

Re: console poking about

All data is stored in Redis, if you wish to poke about look at the following to figure out what keys to look at:

redis_retry_key
failure_key

Re: access to retry count in code

All resque-retry internals can be accessed by your job code, you can access the number of times a job has retried with the retry_attempt method.

Depending on the number of jobs; I'd be tempted to simply have a job log some timestamp/retry information to a file that you can later grep/paw over. If you have a larger number of jobs perhaps only start doing this after the 15th retry?

Other

If you've not already read it, this blog post may prove useful: http://hone.heroku.com/resque/2012/08/21/resque-signals.html

@davetron5000
Copy link
Contributor Author

Thanks, good info. I will poke around and see what I can see. I tried some of this out locally and it does seem to be working, so I can at least add some logging. My fear is that the jobs are failing outside the resque hook/perform cycle meaning resque-retry isn't getting a hold of these failures.

@lantins
Copy link
Owner

lantins commented Apr 19, 2014

I have a gut feeling the blog post will help if you were not aware of the signal handling change, there are a number of Resque issues related to Heroku/SIGTERM, this change addresses those issues.

Please let us know what you find.

@jzaleski
Copy link
Collaborator

Thanks @lantins that is exactly the resource I was going to point him at.

@davetron5000 two quick questions.. When you start a worker, what options
are you specifying in the environment? Also, how long do your jobs take to
finish on average?

On Saturday, April 19, 2014, Luke Antins
<[email protected]javascript:_e(%7B%7D,'cvml','[email protected]');>
wrote:

I have a gut feeling the blog post will help if you were not aware of the
signal handling change, there are a number of Resque issues related to
Heroku/SIGTERM, this change addresses those issues.

Please let us know what you find.


Reply to this email directly or view it on GitHubhttps://github.com//issues/97#issuecomment-40880010
.

@davetron5000
Copy link
Contributor Author

Our invocations generally look like this:

env NEW_RELIC_DISPATCHER=resque TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 bundle exec rake environment resque:work QUEUE=* --trace

We've been setting TERM_CHILD for the reasons outlined in that article you linked. I haven't resorted to turning on VVERBOSE yet, but I could certainly try it for a few days at least…

@lantins
Copy link
Owner

lantins commented Apr 20, 2014

  • The retry logic should deff be firing (but again; doesn't mean it is!! :))
  • I'd be very curious for you to investigate point (4) that I mentioned above, without the web interface, are the retries being used up all at once? or are they used up after 20 mins.
  • I'm still game for having a chat if thats still possible on Monday; perhaps we could take a look together?
  • It would be nice if we could re-produce this in a standalone app running on a free Dyno (perhaps pulling in all the same dependencies your app is). Can you supply the Gemfile for your app?

@jzaleski Hoping you have some more ideas! :)

@davetron5000
Copy link
Contributor Author

I'm gonna run our code today with VVERBOSE, and I also added some logging about the number of retries each time the job runs. This particular job fires a lot during the (East Cost of the US) afternoon, so we're bound to see a failure. With that I can figure out how many times it retries (and when) or if the job simply died out-of-band.

Thanks again for the help. Will report back in a few hours…

@jzaleski
Copy link
Collaborator

Sounds good @davetron5000, the more information the better. I was planning
to do some work on this today so please let us know your findings.

On Monday, April 21, 2014, David Copeland [email protected] wrote:

I'm gonna run our code today with VVERBOSE, and I also added some logging
about the number of retries each time the job runs. This particular job
fires a lot during the (East Cost of the US) afternoon, so we're bound to
see a failure. With that I can figure out how many times it retries (and
when) or if the job simply died out-of-band.

Thanks again for the help. Will report back in a few hours…


Reply to this email directly or view it on GitHubhttps://github.com//issues/97#issuecomment-40932208
.

@davetron5000
Copy link
Contributor Author

OK, so it happened again this morning (finally :). Here's the log excerpt:

Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 2: got: (Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil]) 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 1005: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil])] 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 1005: (Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil]) failed: #<Resque::TermException: SIGTERM> 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil])] 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 2: Exiting... 

Observations:

  • "done" message doesn't happen
  • first line of my job is logging—that isn't showing up, so it seems that perform isn't being called
  • after_fork is being called before before_fork?
  • There is no stack trace in the failed record in the failed queue, just SIGTERM and Resque::TermException.
  • nothing is in redis under the job's redis_retry_key
  • I have no after_fork hooks configured, nor does resque-retry or resque-status.

It seems like its either dying in the after_fork hook, or, it got an exception before calling perform and is dying in the failure hooks.

Does this point us somewhere more specific?

@davetron5000
Copy link
Contributor Author

OK, more info. Here's the complete log message from a different failed job, noise stripped

10:25:56 :  ** [17:25:55 2014-04-22] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil])] 
10:25:56 :  ** [17:25:55 2014-04-22] 2: got: (Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil]) 
10:25:56 :  ** [17:25:56 2014-04-22] 1693: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil])] 
10:25:56 :  remote_ip[n/a] uuid[n/a] pid[1693] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] Class(1101343,PI scan): attempt 0 of 20 
10:25:56 :  BACKGROUND AFTER Transmitting Message on sf.merch.style.sku.item.1101343.updated -> {:item_id=>1101343} 
10:25:56 :  BACKGROUND START Transmitting Message on sf.merch.style.sku.item.1101343.updated -> {:item_id=>1101343} 
10:25:57 :  ** [17:25:57 2014-04-22] 2: Exiting... 
10:25:57 :  BACKGROUND END Transmitting Message on sf.merch.style.sku.item.1101343.updated -> {:item_id=>1101343} 
10:25:58 :  ** [17:25:57 2014-04-22] 1693: (Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil]) failed: #<Resque::TermException: SIGTERM> 

All that BACKGROUND stuff is from the job, and it's coming from code running in another thread spawned by sucker punch. Basically, it's sending a message on a message bus, and using sucker punch so that that doesn't tie up the main thread. This makes sense if the code is run in a Rails request cycle with a persistent worker such us Unicorn.

I'm guess since Resque workers are short-lived, this is causing a problem, which I will now explain in a timeline, so that someone else can agree with me:

Ticks Master Resque Worker Child Worker Sucker Punch Thread
0 Waiting for Work
1 Sees work, forks alive!
2 Waiting for work working!
3 spawns thread alive!
4 all done still alive!
5 dead? still alive?

Seems like launching threads from a resque job that are intended to outlive the forked worker process is likely a bad idea? I will attempt to reconfigure this stuff without spawning threads and see if that changes things.

@lantins
Copy link
Owner

lantins commented Apr 22, 2014

Thanks for the information, it seems your figuring it out quite fine by yourself :)

Yes, the threads associated with the worker cannot outlive the parent process.

If the processing (sending the message) happens in another thread/process from the worker it seems sensible the worker won't know anything about any failures, so the plugin would be useless.

Unless the process fails before your thread is spawned; it will never trigger a retry.

I'd suggest sending the message within the same process/thread as the main worker.

If your timeline is correct; the around_perform hook used by resque-retry should still be triggered, so getting a SIGTERM before the child worker perform completes should still trigger the retry logic. But perhaps the around_perform and perform have completed okay? as far as the worker knows anyhow.

@davetron5000
Copy link
Contributor Author

OK, more info now. sucker punch, celluloid, and new threads removed. What I'm seeing now is the job getting an exception in what I think is the reconnect call in Resque::Worker (code excerpt here), and it seems like the job is being both retried and sent to the failed queue.

# Resque::Worker#work excerpt
else # this block is the childprocess
 unregister_signal_handlers if will_fork? && term_child
 begin

   reconnect
   perform(job, &block)  # <=== logs indicate this is not being called

 rescue Exception => exception
   report_failed_job(job,exception)
 end

 # ...

Here's what I see in my logs.

First, we see the job get picked up:

Apr 22 17:54:09 app/fire_item_event_worker.3:  ** [00:54:09 2014-04-23] 2: got: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) 

worker is fire_item_event_worker.3 with a PID of 2.

Next, we see that same worker and PID run the before_fork hooks (of which we don't have any configured anywhere that I can tell):

Apr 22 17:54:09 app/fire_item_event_worker.3:  ** [00:54:09 2014-04-23] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"])] 

The first line of my perform method prints a log message, which we don't see. instead we see that child pid 632 of our parent worker (fire_item_event_worker.3) die immediately with a SIGTERM:

Apr 22 17:54:10 app/fire_item_event_worker.3:  ** [00:54:10 2014-04-23] 632: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) failed: #<Resque::TermException: SIGTERM> 

The timestamp of this failure is the same as what we see in resque-web:

resque_

I believe this is raised from one of the signal handlers, due to the string "SIGTERM" being the only message:

    # Resque::Worker
    def unregister_signal_handlers
      trap('TERM') do
        trap ('TERM') do 
          # ignore subsequent terms               
        end  
        raise TermException.new("SIGTERM") # <===========
      end 
      trap('INT', 'DEFAULT')

      begin
        trap('QUIT', 'DEFAULT')
        trap('USR1', 'DEFAULT')
        trap('USR2', 'DEFAULT')
      rescue ArgumentError
      end
    end

Around 25 seconds later, we see this: (note that the job is configured to retry after 60 seconds):

Job is picked up, this time by a different worker, fire_item_event_worker.1:

Apr 22 17:55:35 app/fire_item_event_worker.1:  ** [00:55:35 2014-04-23] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"])] 
Apr 22 17:55:35 app/fire_item_event_worker.1:  ** [00:55:35 2014-04-23] 2: got: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) 

Job is now fully executed:

Apr 22 17:55:35 app/fire_item_event_worker.1:  ** [00:55:35 2014-04-23] 26452: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"])]  

Here is where the job's perform method takes over (disgregd BACKGROUND messages—those messages are not being run in a thread)

Apr 22 17:55:35 app/fire_item_event_worker.1:  Class(918302,assumed rebinned): attempt 0 of 20 
Apr 22 17:55:36 app/fire_item_event_worker.1:  BACKGROUND START Transmitting Message on sf.merch.style.sku.item.918302.updated -> {:item_id=>918302} 
Apr 22 17:55:37 app/fire_item_event_worker.1:  BACKGROUND END Transmitting Message on sf.merch.style.sku.item.918302.updated -> {:item_id=>918302} 
Apr 22 17:55:37 app/fire_item_event_worker.1:  BACKGROUND AFTER Transmitting Message on sf.merch.style.sku.item.918302.updated -> {:item_id=>918302} 
Apr 22 17:55:37 app/fire_item_event_worker.1:  ** [00:55:37 2014-04-23] 26452: done: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) 

I have verified in our logs that the code that queues this job occured only once, i.e. there were not two jobs with the same args queued.

So, there are two bits of odd behavior:

  • Job is being retried sooner than expected (and if it were re-queued immediately, there are likely many other jobs of this type in the queue, so a 20 second wait is plausible):
  • The SIGTERM is making it to the failed queue, despite the job being retried.

Job being retried sooner

My only explanation is that somehow, resque-retry is getting a retry_delay of 0 in this code:

      def try_again(exception, *args)
        temp_retry_delay = ([-1, 1].include?(method(:retry_delay).arity) ? retry_delay(exception.class) : retry_delay)

        retry_in_queue = retry_job_delegate ? retry_job_delegate : self
        if temp_retry_delay <= 0 # <==============
          Resque.enqueue(retry_in_queue, *args_for_retry(*args))
        else
          Resque.enqueue_in(temp_retry_delay, retry_in_queue, *args_for_retry(*args))
        end
        # ...
      end

In console the class returns a method with arity -1, so I don't see how this could happen, but since there's no logging about what's going on, it's hard to be sure. It certainly seems possible that this could code run before @retry_exceptions is set, which would set off a codepath returning a 0 for retry_delay.

Job being put into the failed queue AND retried

The only way I can explain this would be this code in the failure backend:

      def save
        if !(retryable? && retrying?)
          cleanup_retry_failure_log!
          super
        elsif retry_delay > 0

If, for some reason, retrying? were to return false, meaning that Resque.redis.exists(retry_key) returned nil, then resque-retry's failure backend would delegate to the default one and the job would go to the failed queue.

And this could happen if the before_perform_retry hook wasn't called, which it most likely wasn't, because my theory is that we died before Resque::Worker.perform is called.

Since there's no logging there's no way to know for sure, but this long-winded explanation fits the facts.

Fixing it

What seemed odd about my journey through the code was that we schedule the job for retry in one place (on_failure hook), but we record the fact that that job is being retried in another (before_perform hook).

Is there a reason for this? Or should we move this code from before_perform_retry into on_failure_retry?

retry_key = redis_retry_key(*args)
Resque.redis.setnx(retry_key, -1)             # default to -1 if not set.

@jzaleski
Copy link
Collaborator

Awesome stuff @davetron5000!

It seems like you may be on the right track. I will give your Fixing it suggestion some consideration. In the mean-time, if you'd like, add some logging to the gem and submit a pull-request (I am sure the community would appreciate it -- I will happily merge it).

@davetron5000
Copy link
Contributor Author

PR submitted. I'm going to try to run my branch of this in production and see what it tells us.

@lantins
Copy link
Owner

lantins commented Apr 25, 2014

@davetron5000 Have you tried your suggested fix? If so anything to report?

If all the tests pass with the suggested fix, I see no reason why we can't merge, but I would like to look at this a little closer (perhaps once we've got some data from your extra logging?).

@davetron5000
Copy link
Contributor Author

Deployed #99 to production yesterday. Unfortunately I didn't have logging inside the failure backend, so I'm still getting an incomplete picture. Also, the log messages are out of order, but I'm going to blame papertrail for that. I've rearranged them in the order they were logged

Job gets picked up:

Apr 24 14:26:48 worker.3:  ** [21:26:47 2014-04-24] 2: got: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) 
Apr 24 14:26:48 worker.3:  ** [21:26:47 2014-04-24] 1211: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])]  
Apr 24 14:26:48 worker.3:  ** [21:26:47 2014-04-24] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])] 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) failed: #<Resque::TermException: SIGTERM> 

We see the new logging from resque-retry:

Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: on_failure_retry 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: Retrying based on exception 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: NOT Retrying based on criteria 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: trying again 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: retry delay: 60 for queue FireItemEventJob  

This is where the story ends, unfortunately. It at least confirms that the logic inside the plugin itself is acting as expected, which blows my theory that the retry delay was becoming 0 through some race condition.

I'm going to deploy my updated fork that has logging in the failure backend and we can see where this left off. Also seems that resque scheduler does some logging which I will enable—it could be the culprit here, too.

As before, the job is run again a few seconds later:

Apr 24 14:27:55 worker.1:  ** [21:27:55 2014-04-24] 2: got: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) 
Apr 24 14:27:55 worker.1:  ** [21:27:55 2014-04-24] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])] 
Apr 24 14:27:56 worker.1:  ** [21:27:55 2014-04-24] 31442: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])] 
Apr 24 14:27:56 worker.1:  ** [21:27:56 2014-04-24] 31442: [828852, "PI scan", 329, nil]: attempt 0 set in Redis 
Apr 24 14:27:56 worker.1:  ** [21:27:56 2014-04-24] 31442: [828852, "PI scan", 329, nil]: before_perform_retry 
Apr 24 14:27:56 worker.1:  remote_ip[n/a] uuid[n/a] pid[31442] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] Class(828852,PI scan): attempt 0 of 20 
Apr 24 14:27:57 worker.1:  remote_ip[n/a] uuid[n/a] pid[31442] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] BACKGROUND START Transmitting Message on sf.merch.style.sku.item.828852.updated -> {:item_id=>828852} 
Apr 24 14:27:57 worker.1:  remote_ip[n/a] uuid[n/a] pid[31442] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] BACKGROUND AFTER Transmitting Message on sf.merch.style.sku.item.828852.updated -> {:item_id=>828852} 
Apr 24 14:27:57 worker.1:  ** [21:27:57 2014-04-24] 31442: [828852, "PI scan", 329, nil]: after_perform_retry, clearing retry key 
Apr 24 14:27:57 worker.1:  ** [21:27:57 2014-04-24] 31442: [828852, "PI scan", 329, nil]: Clearing retry key 
Apr 24 14:27:57 worker.1:  ** [21:27:57 2014-04-24] 31442: done: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) 

@lantins
Copy link
Owner

lantins commented Apr 25, 2014

Okay, keep us updated, as I'm sure you will!

@davetron5000
Copy link
Contributor Author

tl;dr

The plugin assumes that every job's perform method is called in order for it to work. In my case, that's not happening, so the job is being re-queued and put into the failed queue.

I don't know what the solution is, but having try_again check for redis_retry_key would at least prevent the job being effectively duplicated

The details

OK, I believe my theory is confirmed.

Here's the log from the plugin:

Apr 25 15:26:13 worker.8:  ** [22:26:12 2014-04-25] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [723809, "assumed rebinned"])] 
Apr 25 15:26:13 worker.8:  ** [22:26:12 2014-04-25] 2: got: (Job{fire_item_event} | FireItemEventJob | [723809, "assumed rebinned"]) 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: (Job{fire_item_event} | FireItemEventJob | [723809, "assumed rebinned"]) failed: #<Resque::TermException: SIGTERM> 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: on_failure_retry 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: Exception is sufficient for a retry 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: user retry criteria is not sufficient for a retry 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: try_again 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: retry delay: 60 for class: FireItemEventJob 

And here's the log from the failure back-end:

Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: failure backend save 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: !(true && false) - sending failure to superclass 

The second line is from this code:

def save
  log 'failure backend save', payload, exception

  if !(retryable? && retrying?)
    log "!(#{retryable?} && #{retryable? && retrying?}) - sending failure to superclass", payload, exception
    cleanup_retry_failure_log!
    super

This means that retrying? is false. It's source:

def retrying?
  Resque.redis.exists(retry_key)
end

And retry_key:

def retry_key
  klass.redis_retry_key(*payload['args'])
end

klass is the job itself, so redis_retry_key is a method provided by the plugin.

So, what this tells me is that when the failure back-end is figuring out what to do, the plugin has not yet placed the job's retry information into Redis, so the failure backend figures the job isn't being retried and defers up the chain (which, in my configuration, places the job on the failed queue).

Since perform on my job isn't being called, nor do we see Resque's after_hook log message, we can assume the error is happening during reconnect here (from https://github.com/resque/resque/blob/1-x-stable/lib/resque/worker.rb#L189 ):

begin

  reconnect
  perform(job, &block)

rescue Exception => exception
  report_failed_job(job,exception)
end

In that case, report_failed_job gets called before any of the perform hooks are called. Sure enough, the redis key is set in a before-perform hook (see https://github.com/stitchfix/resque-retry/blob/72ac3568d4bade04c77222183b6e2d2d71dbd8a2/lib/resque/plugins/retry.rb#L337 ):

def before_perform_retry(*args)
  log 'before_perform_retry', args
  @on_failure_retry_hook_already_called = false

  # store number of retry attempts.
  retry_key = redis_retry_key(*args)
  Resque.redis.setnx(retry_key, -1)             # <=======================
  @retry_attempt = Resque.redis.incr(retry_key)
  log "attempt: #{@retry_attempt} set in Redis", args
  @retry_attempt
end

So, back to report_failed_job, it calls Job#fail, which is:

def fail(exception)
  run_failure_hooks(exception)
  Failure.create \
    :payload   => payload,
    :exception => exception,
    :worker    => worker,
    :queue     => queue
end

run_failure_hooks is what called on_failure_retry in the plugin.

So, it seems that the intended happy path for the plugin would be as follows:

Tick method activity
1 Worker.work
2 fork
3 reconnect
4 Worker.perform
5 Job.perform
6 perform hooks
7 before_perform_retry Sets redis_retry_key
8 MyJob.perform does its thing
9 Exception raised!!
10 Worker.work rescue that exception
11 Worker.report_failed_job
12 Job.fail
13 run_failure_hooks
14 on_failure_retry
15 try_again queues the job for later
16 Job.fail Failure.create
17 Failure.create MultipleWithRetrySuppression.save
18 MultipleWithRetrySuppression.save sees that redis_retry_key is there, so don't bubble up the job to the failed queue

What's happening in my case, is this:

Tick method activity
1 Worker.work
2 fork
3 reconnect
4 Exception raised!
5 Worker.work rescue that exception
6 Worker.report_failed_job
7 Job.fail
8 run_failure_hooks
9 on_failure_retry
10 try_again queues the job for later
11 Job.fail Failure.create
12 Failure.create MultipleWithRetrySuppression.save
13 MultipleWithRetrySuppression.save doesn't find redis_retry_key so job bubbles up to failed queue, despite having been queued for retry at Tick 10

Whew! All that to say that the retry logic assumes that we get as far as perform, or it doesn't work.

Solving this is not as clear to me. We can't move the setting of redis_retry_key into on_failure_retry because it's clear that redis_retry_key needs to be set for every job.

@jzaleski
Copy link
Collaborator

jzaleski commented May 2, 2014

@davetron5000 are you working on a fix for this? Should I expect a pull-request? Just trying to ensure that this does not fall off the grid.

@davetron5000
Copy link
Contributor Author

I am going to submit a pull request for this, just have been sidetracked. I think you are good to wait until you hear from me

@jzaleski
Copy link
Collaborator

jzaleski commented May 2, 2014

Alrighty. Thanks for the the update.

On Fri, May 2, 2014 at 12:23 PM, David Copeland [email protected]:

I am going to submit a pull request for this, just have been sidetracked.
I think you are good to wait until you hear from me


Reply to this email directly or view it on GitHubhttps://github.com//issues/97#issuecomment-42049917
.

@davetron5000
Copy link
Contributor Author

My proposed fix is in #100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants