Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add INFO task even log line and make logmon less noisy #15842

Merged
merged 3 commits into from
Jan 20, 2023

Conversation

schmichael
Copy link
Member

Fixes #15840

While this makes Nomad client agent logging noisier, I think it adds a lot of value. See #15840 for details.

Starting and stopping a task before this change (5 lines, 1011 bytes w/timestamps):

[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad-1.4.3: opening fifo: alloc_id=701624a4-d78d-c5e7-bd17-47ce170d65ce task=sleepy @module=logmon path=/tmp/NomadClient582607065/701624a4-d78d-c5e7-bd17-47ce170d65ce/alloc/logs/.sleepy.stdout.fifo timestamp=2023-01-20T11:35:20.575-0800
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad-1.4.3: opening fifo: alloc_id=701624a4-d78d-c5e7-bd17-47ce170d65ce task=sleepy @module=logmon path=/tmp/NomadClient582607065/701624a4-d78d-c5e7-bd17-47ce170d65ce/alloc/logs/.sleepy.stderr.fifo timestamp=2023-01-20T11:35:20.575-0800
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[INFO]  client.gc: marking allocation for GC: alloc_id=701624a4-d78d-c5e7-bd17-47ce170d65ce

Starting and stopping a task after these changes (13 lines, 2423 bytes w/timestamps):

[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy type=Started msg="Task started by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
[INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 driver=raw_exec task_name=sleepy path=/home/schmichael/go/bin/nomad pid=30704
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy type=Terminated msg="Exit Code: 130, Signal: 2" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy type=Killed msg="Task successfully killed" failed=false
[INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9 task=sleepy path=/home/schmichael/go/bin/nomad pid=30688
[INFO]  client.gc: marking allocation for GC: alloc_id=768601e2-8140-1cf1-55a6-0120282354b9

(commits also include log snippets for each individual change)

Fixes #15840

Example INFO level client logs with this enabled:

```
[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stdout.fifo @module=logmon timestamp=2023-01-20T11:19:34.275-0800
[INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy @module=logmon path=/tmp/NomadClient2414238708/b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51/alloc/logs/.sleepy.stderr.fifo timestamp=2023-01-20T11:19:34.275-0800
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Started msg="Task started by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
[INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 driver=raw_exec task_name=sleepy path=/home/schmichael/go/bin/nomad pid=27668
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Terminated msg="Exit Code: 130, Signal: 2" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy type=Killed msg="Task successfully killed" failed=false
[INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51 task=sleepy path=/home/schmichael/go/bin/nomad pid=27653
[INFO]  client.gc: marking allocation for GC: alloc_id=b3dab5a9-91fd-da9a-ae89-ef7f1eceaf51
```

So task events will approximately *double* the number of per-task log
lines, but I think they add a lot of value.
Cannot imagine why users care and removes 2 log lines per task
invocation.

```

[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type="Task Setup" msg="Building Task Directory" failed=false
<<< 2 "opening fifo" lines elided here >>>
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1cafb2dc-302e-2c92-7845-f56618bc8648 task=sleepy type=Started msg="Task started by client" failed=false
```
@lindleywhite
Copy link
Contributor

👍🏻 💯

Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I wonder if it would be easier to read if the event.DisplayMessage was the log message itself, like:

tr.logger.Info(event.DisplayMessage, "type", event.Type, "failed", event.FailsTask)

but maybe having a more structured log output is better? Not sure 🤷

@tgross
Copy link
Member

tgross commented Jan 20, 2023

I wonder if it would be easier to read if the event.DisplayMessage was the log message itself, like:

Sometimes those display messages can become long rambling things if they're picking up third-party error messages like in CSI and CNI plugins. Having the "Task event" prefix and some structure will make it easier to find just the task events (or perhaps more importantly, filter them all out if you're looking for something else!).

@schmichael
Copy link
Member Author

schmichael commented Jan 20, 2023

but maybe having a more structured log output is better?

Here's what that looks like:

[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner: Task received by client: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy type=Received failed=false
[INFO]  client.alloc_runner.task_runner: Building Task Directory: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy type="Task Setup" failed=false
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner: Task started by client: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy type=Started failed=false
[INFO]  client.alloc_runner.task_runner: Sent interrupt. Waiting 5s before force killing: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy type=Killing failed=false
[INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 driver=raw_exec task_name=sleepy path=/home/schmichael/go/bin/nomad pid=36362
[INFO]  client.alloc_runner.task_runner: Exit Code: 130, Signal: 2: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy type=Terminated failed=false
[INFO]  client.alloc_runner.task_runner: Task successfully killed: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy type=Killed failed=false
[INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5 task=sleepy path=/home/schmichael/go/bin/nomad pid=36345
[INFO]  client.gc: marking allocation for GC: alloc_id=aab31e37-6147-8a44-273e-6fcbe562aee5

I think they're a bit confusing without more context as to what they are. The type and failed fields seem really random too. Why would Task started by client ever have failed=true?! ...well obviously (to us) it wouldn't but to know that you have to know that this is actually a Task Event and every Task Event specifies a type and whether it fails the task.

So while it's not totally idiomatic for the rest of our logging, I kind of like making the log line message be a kind of "Type" for the rest of the log line.

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 20, 2023

Another option would be to use a named log for logging task events to make the source more clear, something like:

tr.logger.Named("task_event").Info(event.DisplayMessage, "type", event.Type, "failed", event.FailsTask)

But that's not a hill I would ever die on 😄

@schmichael
Copy link
Member Author

Another option would be to use a named log for logging task events to make the source more clear, something like:

Good point, here's what that looks like:

[INFO]  client: node registration complete
[INFO]  client.alloc_runner.task_runner.task_event: Task received by client: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy type=Received failed=false
[INFO]  client.alloc_runner.task_runner.task_event: Building Task Directory: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy type="Task Setup" failed=false
[WARN]  client.alloc_runner.task_runner.task_hook.logmon: plugin configured with a nil SecureConfig: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy
[INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c sleep 1000]}"
[WARN]  client.driver_mgr.raw_exec.executor: plugin configured with a nil SecureConfig: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 driver=raw_exec task_name=sleepy
[INFO]  client.alloc_runner.task_runner.task_event: Task started by client: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy type=Started failed=false
[INFO]  client.alloc_runner.task_runner.task_event: Sent interrupt. Waiting 5s before force killing: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy type=Killing failed=false
[INFO]  client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 driver=raw_exec task_name=sleepy path=/home/schmichael/go/bin/nomad pid=41104
[INFO]  client.alloc_runner.task_runner.task_event: Exit Code: 130, Signal: 2: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy type=Terminated failed=false
[INFO]  client.alloc_runner.task_runner.task_event: Task successfully killed: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy type=Killed failed=false
[INFO]  client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6 task=sleepy path=/home/schmichael/go/bin/nomad pid=41085
[INFO]  client.gc: marking allocation for GC: alloc_id=2d02071c-1a88-6f2e-b44b-eb4013f302e6

@schmichael
Copy link
Member Author

schmichael commented Jan 20, 2023

I think I still prefer Task event. I'm very biased against our named loggers. They're way more verbose than I'd like and make our output tremendously jagged. The _runners and _mgrs add nothing but noise for end users, and I'm afraid they might be trained to ignore the component hierarchy completely (outside of "client" vs "server").

Going to leave this up for a minute to noodle on or gather more comments as I feel like we might be referring to this thread for future guidance. 😅

(To be clear I'm allowed to be biased against them because IIRC they're my fault.)

@lgfa29
Copy link
Contributor

lgfa29 commented Jan 20, 2023

I think the code is good as is, it just looked a bit weird to me at first to have several log messages that looked the same.

If the log in ingested in a system that hides the extra fields by default (I think Datadog does it? Or maybe it's CloudWatch? Can't remember 😅) they would all look the same.

@schmichael
Copy link
Member Author

k, I'm merging as-is then. If I'm going to regret a decision I want to at least make a new regret (static Task event log line) instead of an old one (deeply nested overly verbose component heirarchy).

@schmichael schmichael merged commit 43fa28a into main Jan 20, 2023
@schmichael schmichael deleted the f-task-event-logging branch January 20, 2023 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad Clients should log Task Events
4 participants