bulkio: provide visibility for events in debug logs #45643

petermattis · 2020-03-03T14:39:07Z

Customer request: provide visibility for cluster level events in the logs. We're already recording these cluster level events in the system.events table, and exposing them in the UI, but the customer wants to tail logs. Their ideal is for the system.events rows to be replicated in every node's logs in case one or more of the nodes are down.

This doesn't really fall under any team's ownership, but I'm taking Bulk I/O because jobs are one of the more frequent creators of events.

The text was updated successfully, but these errors were encountered:

dt · 2020-03-03T15:05:10Z

Interesting. We'd essentially need to... poll the eventlog table and log changes?

I was thinking about this from a slightly different angle recently which was that when bulk jobs schedule workers (distsql flow procs), those workers should log at startup and defer a log at exit, so you'd have an idea of why a given node was running a given piece of code.

petermattis · 2020-03-03T15:19:44Z

Interesting. We'd essentially need to... poll the eventlog table and log changes?

That's the obvious implementation. Perhaps we can do something more clever. I'm sure @ajwerner would suggest something with range feeds.

ajwerner · 2020-03-03T15:32:38Z

I'm sure @ajwerner would suggest something with range feeds.

Indeed I would.

knz · 2020-03-09T16:39:21Z

Can we also consider a network log sink - have our logs push to a network service.

petermattis · 2020-03-09T17:11:38Z

Can we also consider a network log sink - have our logs push to a network service.

Definitely. Is there a standard one to use?

knz · 2020-03-09T19:35:01Z

There are a couple actually. Aaron and I were talking about making that part of the security roadmap, since there's a "problems to solve" section already for this kind of work.

knz · 2020-03-09T19:36:00Z

(My technical proposal would be to start an experiment using syslog - which has its own standard protocol and distributed network sinks as plug-ins - and see where that brings us.)

Fixes cockroachdb#45643 Cockroach server logs important system events into the eventlog table. These events are exposed on the web UI. However, the operators often want to see those global events while tailing a log file on a single node. Implement a mechanism for the server running on each node to emit those system events into server log file. Release notes (feature): Log system wide events into cockroach.log file on every node.

Fixes cockroachdb#45643 Cockroach server logs important system events into the eventlog table. These events are exposed on the web UI. However, the operators often want to see those global events while tailing a log file on a single node. Implement a mechanism for the server running on each node to emit those system events into server log file. If the system log scanning is enabled (via server.eventlogsink.enabled setting), then each node scans the system log table periodically, every server.eventlogsink.period period; For example, below is a single system event emitted to the regular log file.: I200323 .... [n1] system.eventlog:n=1:'set_cluster_setting':2020-03-23 19:24:29.948279 +0000 UTC '{"SettingName":"server.eventlogsink.max_entries","Value":"101","User":"root"}' There is no guaranteed that all events from system log will be eimitted. In particular, upon node restart, we only emit events that were generated from that point on. Also, if for whatever reason,we start emitting too many system log messages, then only up to the server.eventlogsink.max_entries (default 100) recent events will be emitted. However, if we think we have "dropped" some events due to confuration settings, we will indicate so in the log. Release notes (feature): Log system wide events into cockroach.log file on every node.

Fixes cockroachdb#45643 Cockroach server logs important system events into the eventlog table. These events are exposed on the web UI. However, the operators often want to see those global events while tailing a log file on a single node. Implement a mechanism for the server running on each node to emit those system events into server log file. If the system log scanning is enabled (via server.eventlogsink.enabled setting), then each node scans the system log table periodically, every server.eventlogsink.period period; For example, below is a single system event emitted to the regular log file.: I200323 .... [n1] system.eventlog:n=1:'set_cluster_setting':2020-03-23 19:24:29.948279 +0000 UTC '{"SettingName":"server.eventlogsink.max_entries","Value":"101","User":"root"}' There is no guaranteed that all events from system log will be eimitted. In particular, upon node restart, we only emit events that were generated from that point on. Also, if for whatever reason,we start emitting too many system log messages, then only up to the server.eventlogsink.max_entries (default 100) recent events will be emitted. If we think we have "dropped" some events due to confuration settings, we will indicate so in the log. The administrators may choose to restrict the set of events emitted by changing server.eventlogsink.include_events and/or server.eventlogsink.exclude_events settings. These settings specify regular expressions to include or exclude events with matching event types. Release notes (feature): Log system wide events into cockroach.log file on every node. This feature allows the administrator logged in into one of the nodes to monitor that nodes log file and see important "system" events, such as table/index creationg, schema change jobs, etc. To use this feature, the server.eventlogsink.enabled setting needs to be set to true.

knz · 2020-04-01T15:53:31Z

I have discussed this with @petermattis today.
For context:

the customer that motivated filing this issue does not consider this urgent. We can schedule it for 20.2.
the original user journey is "an operator wants CLI access to a stream of cluster events for manual and automatic monitoring".
it's not sufficient to provide "network logging" (i.e. logging to a network sink) as a feature, there will need to be a way for an operator looking at 1 node to see events coming from the entire cluster.
double-reporting is OK as long as there's a timestamp or ID that can be used to de-duplicate.
under-reporting is a problem. The customer can tolerate double-reporting but not missed events.
a changefeed-based solution, or a solution that can capitalize on one of the mechanisms already present in crdb, is preferable to the introduction of a new ad-hoc mechanism.
an intermediate, ad-hoc solution with known deficiencies is not "palatable". In peter's words "I’d find it more palatable if it didn’t have deficiencies, but even then I wouldn’t push to get it in."

knz added A-logging In and around the logging infrastructure. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Mar 9, 2020

miretskiy self-assigned this Mar 12, 2020

miretskiy mentioned this issue Mar 16, 2020

server: Implement logsink for the system event logs. #46143

Closed

miretskiy removed their assignment Apr 1, 2020

kenliu added the T-disaster-recovery label Dec 5, 2020

knz mentioned this issue Dec 21, 2020

eventpb: new JSON serialization with redaction markers #57990

Merged

knz self-assigned this Dec 21, 2020

craig bot closed this as completed in e552218 Dec 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulkio: provide visibility for events in debug logs #45643

bulkio: provide visibility for events in debug logs #45643

petermattis commented Mar 3, 2020

dt commented Mar 3, 2020

petermattis commented Mar 3, 2020

ajwerner commented Mar 3, 2020

knz commented Mar 9, 2020

petermattis commented Mar 9, 2020

knz commented Mar 9, 2020

knz commented Mar 9, 2020

knz commented Apr 1, 2020

bulkio: provide visibility for events in debug logs #45643

bulkio: provide visibility for events in debug logs #45643

Comments

petermattis commented Mar 3, 2020

dt commented Mar 3, 2020

petermattis commented Mar 3, 2020

ajwerner commented Mar 3, 2020

knz commented Mar 9, 2020

petermattis commented Mar 9, 2020

knz commented Mar 9, 2020

knz commented Mar 9, 2020

knz commented Apr 1, 2020