You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm seeing fairly frequent instances in which a job appears to be running for 30 minutes -- but, as far as I can tell, is not.
One of the job types where I see this has a metric that is recorded to DataDog when it successfully completes. That metric is never above 90 seconds or so. I am seeing some job failures here and there, but they are all below 30 minutes.
Instead, I suspect this is related to another problem we're having: We're seeing bursts of dial errors with workers and other clients trying to connect to Faktory unsuccessfully. What would happen if a job completed (success or failure), but the worker was unable to report that status back to the server because it couldn't get a connection to the server?
It appears that .with can return an error, but that this error is routinely ignored.
I don't think calling panic is appropriate for this situation, but perhaps some combination of the following might be helpful:
Getting and holding a connection to both retrieve the job, and report the result (and maybe as a side-effect, make the connection available to the worker via context so it can fire jobs without having to establish its own connection).
Introducing a pluggable logger mechanism so we can at least record that these failures are happening.
Having some sort of retry loop, specifically around reporting the results of a job.
The first option would have some risks/challenges of its own, of course. You'd need to ensure the connection didn't time out, handle reconnecting if it did go away (either due to timeout or a server failure), etc. I'm sure you have more insight into how it would impact operations concerns in general, so forgive me if it's an Obviously Stupid Idea. That said, for situations involving a relatively high job volume (mid-hundreds- low-thousands per second), the many-transient-connections thing has proven to be a bit of a challenge (I'm paying attention to #219 / #222 for this reason, and we've been having to be careful about tuning things like FIN_WAIT and such in our server configuration).
The text was updated successfully, but these errors were encountered:
I'm seeing fairly frequent instances in which a job appears to be running for 30 minutes -- but, as far as I can tell, is not.
One of the job types where I see this has a metric that is recorded to DataDog when it successfully completes. That metric is never above 90 seconds or so. I am seeing some job failures here and there, but they are all below 30 minutes.
Instead, I suspect this is related to another problem we're having: We're seeing bursts of dial errors with workers and other clients trying to connect to Faktory unsuccessfully. What would happen if a job completed (success or failure), but the worker was unable to report that status back to the server because it couldn't get a connection to the server?
Looking at
faktory_worker_go
:https://github.com/contribsys/faktory_worker_go/blob/master/runner.go#L260
It appears that
.with
can return an error, but that this error is routinely ignored.I don't think calling
panic
is appropriate for this situation, but perhaps some combination of the following might be helpful:The first option would have some risks/challenges of its own, of course. You'd need to ensure the connection didn't time out, handle reconnecting if it did go away (either due to timeout or a server failure), etc. I'm sure you have more insight into how it would impact operations concerns in general, so forgive me if it's an Obviously Stupid Idea. That said, for situations involving a relatively high job volume (mid-hundreds- low-thousands per second), the many-transient-connections thing has proven to be a bit of a challenge (I'm paying attention to #219 / #222 for this reason, and we've been having to be careful about tuning things like
FIN_WAIT
and such in our server configuration).The text was updated successfully, but these errors were encountered: