logging: Attempt to recover logmon failures #5577

endocrimes · 2019-04-18T10:57:29Z

Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it shouldn't
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.

A future improvement would be to wrap logmon reattachment failures and
attempt to ensure that the process has been destroyed, however this
would require care in pid-rotation heavy environments and on windows.

Currently, when logmon fails to reattach, we will retry reattachment to the same pid until the task restart specification is exhausted. Because we cannot clear hook state during error conditions, it is not possible for us to signal to a future restart that it _shouldn't_ attempt to reattach to the plugin. Here we revert to explicitly detecting reattachment seperately from a launch of a new logmon, so we can recover from scenarios where a logmon plugin has failed. This is a net improvement over the current hard failure situation, as it means in the most common case (the pid has gone away), we can recover. Other reattachment failure modes where the plugin may still be running could potentially cause a duplicate process, or a subsequent failure to launch a new plugin. If there was a duplicate process, it could potentially cause duplicate logging. This is better than a production workload outage. If there was a subsequent failure to launch a new plugin, it would fail in the same (retry until restarts are exhausted) as the current failure mode.

client/allocrunner/taskrunner/logmon_hook.go

nickethier · 2019-04-19T02:17:03Z

client/allocrunner/taskrunner/logmon_hook.go

+	// We did not reattach to a plugin and one is still not running.
+	if h.logmonPluginClient == nil || h.logmonPluginClient.Exited() {
+		if err := h.launchLogMon(nil); err != nil {
+			// Retry errors launching logmon as logmon may have crashed on start and


Ugh I wrote this comment and I can't make out what it means. Could you rewrite this into something intelligible. We want to convey that subsequent attempts to launch logmon will likely error at this point so a recoverable error is returned so the task can reattempt to start.

Co-Authored-By: notnoop <[email protected]>

github-actions · 2023-02-12T02:17:28Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

endocrimes requested a review from schmichael April 18, 2019 11:19

endocrimes force-pushed the dani/b-logmon-unrecoverable branch from 7d24952 to 21c3970 Compare April 18, 2019 11:41

endocrimes force-pushed the dani/b-logmon-unrecoverable branch from 21c3970 to 269e2c0 Compare April 18, 2019 11:42

schmichael changed the title ~~loggging: Attempt to recover logmon failures~~ logging: Attempt to recover logmon failures Apr 18, 2019

schmichael approved these changes Apr 18, 2019

View reviewed changes

client/allocrunner/taskrunner/logmon_hook.go Outdated Show resolved Hide resolved

nickethier reviewed Apr 19, 2019

View reviewed changes

tweak logging level for failed log line

0f91277

Co-Authored-By: notnoop <[email protected]>

notnoop merged commit 151e0ae into master Apr 22, 2019

notnoop deleted the dani/b-logmon-unrecoverable branch April 22, 2019 18:40

endocrimes added a commit that referenced this pull request Apr 23, 2019

changelog: Update for GH-5512 and GH-5577

04d4d86

preetapan mentioned this pull request Apr 23, 2019

Client node remains in down state after restarting host #5589

Closed

This was referenced Apr 25, 2019

Nomad 0.9 - system job failed to start after nomad restart #5611

Closed

Nomad 0.9 - Turn-off VM sometimes cause not starting our ruby webserver #5613

Closed

fprovencher mentioned this pull request Jun 11, 2019

Task failing with logmon failed to create fifo for extracting logs #5803

Closed

krishnaprateek6 mentioned this pull request Jun 7, 2022

tasks fail to restart because of failure to launch logmon #13198

Open

github-actions bot locked as resolved and limited conversation to collaborators Feb 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

logging: Attempt to recover logmon failures #5577

logging: Attempt to recover logmon failures #5577

endocrimes commented Apr 18, 2019

nickethier Apr 19, 2019

github-actions bot commented Feb 12, 2023

logging: Attempt to recover logmon failures #5577

logging: Attempt to recover logmon failures #5577

Conversation

endocrimes commented Apr 18, 2019

nickethier Apr 19, 2019

Choose a reason for hiding this comment

github-actions bot commented Feb 12, 2023