-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
logging: Attempt to recover logmon failures #5577
Conversation
7d24952
to
21c3970
Compare
Currently, when logmon fails to reattach, we will retry reattachment to the same pid until the task restart specification is exhausted. Because we cannot clear hook state during error conditions, it is not possible for us to signal to a future restart that it _shouldn't_ attempt to reattach to the plugin. Here we revert to explicitly detecting reattachment seperately from a launch of a new logmon, so we can recover from scenarios where a logmon plugin has failed. This is a net improvement over the current hard failure situation, as it means in the most common case (the pid has gone away), we can recover. Other reattachment failure modes where the plugin may still be running could potentially cause a duplicate process, or a subsequent failure to launch a new plugin. If there was a duplicate process, it could potentially cause duplicate logging. This is better than a production workload outage. If there was a subsequent failure to launch a new plugin, it would fail in the same (retry until restarts are exhausted) as the current failure mode.
21c3970
to
269e2c0
Compare
// We did not reattach to a plugin and one is still not running. | ||
if h.logmonPluginClient == nil || h.logmonPluginClient.Exited() { | ||
if err := h.launchLogMon(nil); err != nil { | ||
// Retry errors launching logmon as logmon may have crashed on start and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh I wrote this comment and I can't make out what it means. Could you rewrite this into something intelligible. We want to convey that subsequent attempts to launch logmon will likely error at this point so a recoverable error is returned so the task can reattempt to start.
Co-Authored-By: notnoop <[email protected]>
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.
Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it shouldn't
attempt to reattach to the plugin.
Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.
This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.
Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.
If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.
If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
A future improvement would be to wrap logmon reattachment failures and
attempt to ensure that the process has been destroyed, however this
would require care in pid-rotation heavy environments and on windows.