You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On most teams I've been on, we run into a situation similar to this:
We will have jobs that fail and end up in some form of DLQ
Even with automated retries, the more third party dependencies we add, the more likely this is
We are aware of strategies to mitigate (some of) these issues, but often times deprioritize it and take on the operational load of an engineer manually retrying the job again or discarding it
Common Solution: Runbook
The team creates something like a runbook with details on why a job could be failing, an investigation to perform, and details around what operational actions to perform.
While this solution works, I've always imagined something more deeply embedded in the UI where an engineer is handling failing jobs.
Possible Solution: ActiveJob Extension + Dashboard UI
Extend ActiveJob to allow markdown or html to be added as a class method
Render this on the "show page" of a discarded job or somewhere to help inform an engineer of what steps they should take before retrying/deleting a job
Dream Solution
When I look at tools like runbook, it makes me excited about a UI version of this that is job-specific.
A few ideas:
Each job can have a customized "runbook" which could include instructions and buttons/inputs that follow a logical flow similar to runbook gem where someone is able to run through a full investigation purely through the UI
When a job is discarded, some commands could be started automatically to add metadata to the failure
Would require rate-limiting / limits to avoid problems when you have thousands of jobs failing
Misc Thoughts
It is possible that good_job is not the right place for this sort of proposal
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Common Problem
On most teams I've been on, we run into a situation similar to this:
Common Solution: Runbook
The team creates something like a runbook with details on why a job could be failing, an investigation to perform, and details around what operational actions to perform.
While this solution works, I've always imagined something more deeply embedded in the UI where an engineer is handling failing jobs.
Possible Solution: ActiveJob Extension + Dashboard UI
Dream Solution
When I look at tools like runbook, it makes me excited about a UI version of this that is job-specific.
A few ideas:
runbook
gem where someone is able to run through a full investigation purely through the UIMisc Thoughts
I would love to hear if anyone else has encountered this problem in general and what tooling they use (or would like to see) to help their teams.
Beta Was this translation helpful? Give feedback.
All reactions