run-once guarantee? #151

jrochkind · 2020-09-21T18:44:01Z

I was interested in the discussion at #65, which I think got me understanding the intent of the advisory lock use here. Regarding @ukd1's question about transactions vs sessions... I'm not sure, but I could believe holding open a transaction for 15 minutes has performance implications that simply holding open a session does not. I believe long-term open connections/sessions are an expected usage pattern for postgres (or any db), but long-term transactions being open are not.

However, thinking through this led me to wonder if I had an example backing up my questioning of the README statement:

GoodJob guarantees that a completely-performed job will run once and only once

In general, this is a hard thing to guarantee in concurrent systems. Most systems don't try to guarantee "exactly once", but something more like "always run at LEAST once, USUALLY run only once, occasionally in some edge cases it might run more than once." Which is one reason it's often said it's important for your ActiveJobs to be idempotent -- so it's okay if they run more than once. (For instance, here's sidekiq on it)

Since it's so rare to make the "run exactly once" claim, I was a a bit suspicious of it. If i understand how advisory locks are being used.... is this an example showing how it can fail and wind up more than once? --

What if the network connecction to postgres gets interupted in minute 7 of the 15 minute job. The session-level advisory lock would be dropped, and another worker free to take the job, correct? However, the original job is still running -- and may even be able to save it's work to the database, because if it was an intermittent network interuption, ActiveRecord in some cases may transparently reestablish the connection again.

Is this a case where a job might be run twice, completely-performed? You do hedge with "completely-performed", but I think it could still be. Although if a job were run once 99% performed, and then again 100% performed... I'm not sure it actually matters that much that it wasn't "completely-performed" twice?

I may have the details of this potential problem not quite right. But I'm suspicious of the "Exactly right" guarantee, and wonder how confident you really are this can be guaranteed even with unexpected infrastructure failures like intermittent network interuption etc. I don't think any other systems, including sidekiq, make a "exactly once" guarantee -- it may be safer for everyone to realize that a job can in some cases be run more than once and must be idempotent.

bensheldon · 2020-09-21T19:07:13Z

@jrochkind thanks for the additional context on this 🙏 It was discussed in #59 which might be a good place to continue the conversation.

jrochkind · 2020-09-21T19:19:18Z

ok thanks. I could copy this discussion over to that closed issue, would you like me to do that?

I remain suspicious of the "guarantee that a completely-performed job will run once and only once", I'm not sure I really believe the "only once" can be guaranteed.

I think I have an example where it might run more than once, as above. But am I mistaken? In general, you are confident that good_job really can guarantee "completely-performed only once", never more than once, regardless of what errors may happen?

bensheldon · 2020-09-21T20:00:36Z

You're welcome to copy it over.

I think the phrase is open to interpretation. I originally had a strict statement close to what you're expecting ("at least once"), but that also caused confusion. It looks like that during a recent documentation overhaul there was an additional statement that was dropped: "Writing reliable, transactional, and idempotent ActiveJob#perform methods is outside the scope of GoodJob." Do you think re-adding that would address your questions?

jrochkind · 2020-09-21T21:08:50Z

So honestly I'm not sure the "completely-performed" is clear what it means, or if the distinction is relevant once explained.

I think it's probably really "at least once", not "exactly once".

Sidekiq says:

Remember that Sidekiq will run your jobs AT LEAST once so they need to be idempotent.

In the context of an example, that I'm not sure is analagous here or not:

By default, Sidekiq 6+ gives workers 25 seconds to shut down. (Run Sidekiq <6 with -t 25.) This is carefully chosen because Heroku and AWS ECS give a process 30 seconds to shutdown before killing it. After 25 seconds, any remaining jobs still in progress are pushed back onto Redis so they can be immediately restarted when Sidekiq starts back up. Remember that Sidekiq will run your jobs AT LEAST once so they need to be idempotent. This is one example of how a job can be run twice.

https://github.com/mperham/sidekiq/wiki/FAQ

Maybe (?) something more like:

"GoodJob will normally run each enqueued job once. GoodJob guarantees that regardless of error conditions all jobs will be run at least once, they won't be dropped. As with most other job queue systems, In unusual circumstances a job can be run more than once so it is important jobs are written to be idempotent."

bensheldon closed this as completed Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run-once guarantee? #151

run-once guarantee? #151

jrochkind commented Sep 21, 2020

bensheldon commented Sep 21, 2020

jrochkind commented Sep 21, 2020

bensheldon commented Sep 21, 2020

jrochkind commented Sep 21, 2020

run-once guarantee? #151

run-once guarantee? #151

Comments

jrochkind commented Sep 21, 2020

bensheldon commented Sep 21, 2020

jrochkind commented Sep 21, 2020

bensheldon commented Sep 21, 2020

jrochkind commented Sep 21, 2020