Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unreleased locks when Resque terminates using the loner option #26

Closed
tjsousa opened this issue Jan 22, 2015 · 5 comments
Closed

Unreleased locks when Resque terminates using the loner option #26

tjsousa opened this issue Jan 22, 2015 · 5 comments

Comments

@tjsousa
Copy link

tjsousa commented Jan 22, 2015

Hi guys,

I've found an odd behaviour when using the loner option, where unreleased locks (without timeout) can be stuck in Redis after a process termination occurs while Resque is enqueueing a job (Heroku deploy environment).

After taking a look at the code, I guess the issue is that the lock is being acquired during a before_enqueue hook, which, during a process termination, can complete successfully while the actual Resque enqueue operation doesn't (it's not transactional from what I know). After this, no new process can actually enqueue a similar job for execution as the lock is already in Redis, and subsequent jobs of the same kind will be inhibited forever.

I was wondering if a possible strategy to handle this situation could be done using a two-phase process where an after_enqueue hook could be used to finally acquire the lock after the actual job enqueue operation completed.

@lantins
Copy link
Owner

lantins commented Jan 22, 2015

Hey @tjsousa

I don't use loner myself, so its difficult for me to judge how this should be played out.

When the process is terminated and it gets 'stuck', do you know HOW its killed?

I'd love to see a PR on how this could be fixed :)
Or we can discuss it some more here and figure out a way together.

@edjames
Copy link

edjames commented Jul 6, 2015

Hi

See my reply to another issue (which is pretty much the same behaviour you describe here): #17

Hope that helps.

@tjsousa
Copy link
Author

tjsousa commented Jul 6, 2015

Thank you @edjames !

Although I have worked around this problem, it actually still persists. Looking at #17, my particular scenario does not use a hash as a parameter, so I'm not sure it has a common cause.

@lantins Answering your question, I know it happens when our app running in Heroku sends KILL signals to our Resque processes, although I wasn't able to pin-point the exact cause (as it only happens sometimes).

@lantins
Copy link
Owner

lantins commented Aug 4, 2015

This reminds me of this issue: lantins/resque-retry#61

@tjsousa
Copy link
Author

tjsousa commented Sep 7, 2015

@lantins, thanks to that share I was finally able to dedicate some time in replicating the issue and, in fact, was able to confirm the lack of lock removal in the case of a dirty exit (e.g. SIGKILL in the child job).

Using a similar approach, we can do the lock cleanup from the worker process through a on_failure hook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants