exception_executions not counted correctly? #215

Timmitry · 2021-02-12T13:44:40Z

Hello, and thank you for this great gem! I'm using it for the first time, but am already very satisfied, as the integration with ActiveJob feels very good 👌

I was just testing my application in production. I have a clockwork process which enqueues a DummyJob, which is then being executed by a separate good_job-process. The dummy job looks like this:

class DummyJob < ApplicationJob
  retry_on KeyError, wait: :exponentially_longer, attempts: 3

  def perform
    sleep(10)
    {}.fetch(:foo) if rand > 0.2
  end
end

E.g. it fails randomly with a key error, and is retried up to 3 times (for this error).

Now I was watching the good_job dashboard, and saw the three executions of the job (newest first):

37cf4067-1678-4de3-85a5-93ae53325981 	be09bce1-16d7-4614-9f08-92a97db36e18 	DummyJob 	default 	2021-02-12 13:22:56 UTC 	KeyError: key not found: :foo 	
{
  "job_id": "be09bce1-16d7-4614-9f08-92a97db36e18",
  "locale": "en",
  "priority": null,
  "timezone": "UTC",
  "arguments": [],
  "job_class": "DummyJob",
  "executions": 2,
  "queue_name": "default",
  "enqueued_at": "2021-02-12T13:22:36Z",
  "provider_job_id": "1a01e6c5-973a-443c-86e0-7c8a274d49ef",
  "exception_executions": {
    "[KeyError]": 12
  }
}

1a01e6c5-973a-443c-86e0-7c8a274d49ef 	be09bce1-16d7-4614-9f08-92a97db36e18 	DummyJob 	default 	2021-02-12 13:22:26 UTC 	KeyError: key not found: :foo 	
{
  "job_id": "be09bce1-16d7-4614-9f08-92a97db36e18",
  "locale": "en",
  "priority": null,
  "timezone": "UTC",
  "arguments": [],
  "job_class": "DummyJob",
  "executions": 1,
  "queue_name": "default",
  "enqueued_at": "2021-02-12T13:22:23Z",
  "provider_job_id": "82974df4-b046-4a9b-9a80-625eb1060144",
  "exception_executions": {
    "[KeyError]": 2
  }
}

82974df4-b046-4a9b-9a80-625eb1060144 	be09bce1-16d7-4614-9f08-92a97db36e18 	DummyJob 	default 	2021-02-12 13:22:12 UTC 	KeyError: key not found: :foo 	
{
  "job_id": "be09bce1-16d7-4614-9f08-92a97db36e18",
  "locale": "en",
  "priority": null,
  "timezone": "UTC",
  "arguments": [],
  "job_class": "DummyJob",
  "executions": 0,
  "queue_name": "default",
  "enqueued_at": "2021-02-12T13:22:12Z",
  "provider_job_id": null,
  "exception_executions": {
    "[KeyError]": 1
  }
}

As you can see, the executions go from 0 to 2, which seems correct. The exception_executions for KeyError, however, go from 1 to 2 to 12? For the next dummy job, which was also retried 3 times, I had 1, 2, and 20.

I'm not sure if I misunderstand some behaviour, or if this might be an ActiveJob bug, but I thought I'd ask here first.

Thank you!

The text was updated successfully, but these errors were encountered:

bensheldon · 2021-02-12T15:37:42Z

@Timmitry Great question! I think you're experiencing the default behavior:

ActiveJob's retry_on, after the number of attempts are exceeded, will re-raise the exception up to the Adapter (GoodJob)
GoodJob, by default, will re-run a job when an exception is raised during job execution.

You can change this either by:

Use a block in your retry_on which is called instead of re-raising the exception when the attempts are exceeded retry_on(KeyError, wait: :exponentially_longer, attempts: 3) { |error| nil }
Change the GoodJob default to not re-run jobs when an exception is raised: GoodJob.retry_on_unhandled_error = false

Please take a look at this section in the Readme. Any ideas on how to better document/explain this behavior?

https://github.com/bensheldon/good_job#retries

Timmitry · 2021-02-12T16:53:39Z

Hey @bensheldon, thanks for the quick answer - I'll be glad to help improve the documentation, but first, some more questions 🙂

Let's start with the following setup for a dummy job:

# initializer
GoodJob.on_thread_error = ->(exception) do
  binding.pry
  Raven.capture_exception(exception)
end
GoodJob.preserve_job_records = true


# dummy_job.rb
class DummyJob < ApplicationJob
  around_perform do |_job, block|
    block.call
  rescue StandardError => e
    Raven.capture_exception(e)
    binding.pry # around_perform
    raise
  end

  retry_on KeyError, wait: 1.second, attempts: 3 do |_job, _exception|
    binding.pry # retry_on
    raise _exception
  end

  def perform
    binding.pry # perform
    {}.fetch(:foo)
  end
end

Now, if I enqueue one dummy job, I'll hit the binding.pry breakpoints in the following order:

perform -> around_perform (Execution 1)
perform -> around_perform (Execution 2)
perform -> around_perform (Execution 3)
retry_on (We specified attempts: 3, and the job failed 3 times, so now in ActiveJob, the job is failed)
retry_on (Why is this being called again?)

And from now on, it always goes like this, in a seemingly infinite loop (at least until exception_executions is up to 40, then I stopped trying).
perform -> around_perform -> retry_on -> retry_on

If i call (_job.)exception_executions in the various binding.pry, it is always an even value in perform and around_perform, e.g. it increases by 2 between executions, but increases by 1 in the retry_on.

The on_thread_error is only being reached when I use !!! to exit the binding.pry.

It seems like GoodJob does infinite retries when not having set GoodJob.retry_on_unhandled_error (e.g. it is true), however, on production, it seems to have stopped randomly at 12 or 20 executions.

Maybe you can shed some light on this behaviour?

Overall, my current thinking would be that retry_on_unhandled_error should always be set to false, and that people should use the retry mechanism from ActiveJob - especially as I do not see a possibility to specify a waiting time between retries in the GoodJob-Retry-handling?

bensheldon · 2021-02-12T19:28:20Z

@Timmitry I'll dig more into the examples you gave. I just wanted to respond to this:

Overall, my current thinking would be that retry_on_unhandled_error should always be set to false, and that people should use the retry mechanism from ActiveJob - especially as I do not see a possibility to specify a waiting time between retries in the GoodJob-Retry-handling?

I largely agree with that. But as a default, I am opinionated that GoodJob should preserve jobs on unhandled exceptions. This is true especially if someone is migrating from another job adapter and likely is not using ActiveJob for error handling at the time they may swap-in GoodJob. Safe by default, is my opinion here, even if it's a bit rough.

bensheldon · 2021-02-19T16:25:21Z

Thanks for documenting all of these things. There's a lot to cover, so let me know if you want me to dive deeper into something:

I'm not able to reproduce the even/odd execution behavior or the stopping at 12 or 20 executions. (we can hop on a Zoom and try some joint debugging). There should be infinite retries when GoodJob.retry_on_unhandled_error = true
Seeing exception_executions counts increase after the retry_on attempts: are exceeded is a bug. In GoodJob's current model, the job params should be immutable and that value should not be changing at all when an exception is raised to GoodJob. GoodJob will re-execute the job but shouldn't modify any of the params of that job. This is fixed in Ensure Job#serialized_params are immutable #218
I was able to reproduce that on_thread_error is not called when an exception bubbles up to GoodJob. I think that's intentional (because the exception is caught by GoodJob), but I'm thinking that maybe it should trigger on any job errors that bubble up.

bensheldon · 2021-05-14T17:45:59Z

@Timmitry I am going to close this issue because I believe all of the problems have either been addressed or have targetted issues.

bensheldon mentioned this issue Feb 19, 2021

Ensure Job#serialized_params are immutable #218

Merged

bensheldon mentioned this issue May 10, 2021

Unhandled ActiveJob errors should trigger GoodJob.on_thread_error #247

Closed

bensheldon closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exception_executions not counted correctly? #215

exception_executions not counted correctly? #215

Timmitry commented Feb 12, 2021

bensheldon commented Feb 12, 2021 •

edited

Loading

Timmitry commented Feb 12, 2021

bensheldon commented Feb 12, 2021

bensheldon commented Feb 19, 2021

bensheldon commented May 14, 2021

exception_executions not counted correctly? #215

exception_executions not counted correctly? #215

Comments

Timmitry commented Feb 12, 2021

bensheldon commented Feb 12, 2021 • edited Loading

Timmitry commented Feb 12, 2021

bensheldon commented Feb 12, 2021

bensheldon commented Feb 19, 2021

bensheldon commented May 14, 2021

bensheldon commented Feb 12, 2021 •

edited

Loading