Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulkio: provide visibility for events in debug logs #45643

Closed
petermattis opened this issue Mar 3, 2020 · 8 comments · Fixed by #57990
Closed

bulkio: provide visibility for events in debug logs #45643

petermattis opened this issue Mar 3, 2020 · 8 comments · Fixed by #57990
Assignees
Labels
A-logging In and around the logging infrastructure. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery

Comments

@petermattis
Copy link
Collaborator

Customer request: provide visibility for cluster level events in the logs. We're already recording these cluster level events in the system.events table, and exposing them in the UI, but the customer wants to tail logs. Their ideal is for the system.events rows to be replicated in every node's logs in case one or more of the nodes are down.

This doesn't really fall under any team's ownership, but I'm taking Bulk I/O because jobs are one of the more frequent creators of events.

@dt
Copy link
Member

dt commented Mar 3, 2020

Interesting. We'd essentially need to... poll the eventlog table and log changes?

I was thinking about this from a slightly different angle recently which was that when bulk jobs schedule workers (distsql flow procs), those workers should log at startup and defer a log at exit, so you'd have an idea of why a given node was running a given piece of code.

@petermattis
Copy link
Collaborator Author

Interesting. We'd essentially need to... poll the eventlog table and log changes?

That's the obvious implementation. Perhaps we can do something more clever. I'm sure @ajwerner would suggest something with range feeds.

@ajwerner
Copy link
Contributor

ajwerner commented Mar 3, 2020

I'm sure @ajwerner would suggest something with range feeds.

Indeed I would.

@knz
Copy link
Contributor

knz commented Mar 9, 2020

Can we also consider a network log sink - have our logs push to a network service.

@knz knz added A-logging In and around the logging infrastructure. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) labels Mar 9, 2020
@petermattis
Copy link
Collaborator Author

Can we also consider a network log sink - have our logs push to a network service.

Definitely. Is there a standard one to use?

@knz
Copy link
Contributor

knz commented Mar 9, 2020

There are a couple actually. Aaron and I were talking about making that part of the security roadmap, since there's a "problems to solve" section already for this kind of work.

@knz
Copy link
Contributor

knz commented Mar 9, 2020

(My technical proposal would be to start an experiment using syslog - which has its own standard protocol and distributed network sinks as plug-ins - and see where that brings us.)

@miretskiy miretskiy self-assigned this Mar 12, 2020
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 16, 2020
Fixes cockroachdb#45643

Cockroach server logs important system events into the eventlog table.
These events are exposed on the web UI.  However, the operators often
want to see those global events while tailing a log file on a single
node.

Implement a mechanism for the server running on each node
to emit those system events into server log file.

Release notes (feature): Log system wide events into cockroach.log
file on every node.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Mar 23, 2020
Fixes cockroachdb#45643

Cockroach server logs important system events into the eventlog table.
These events are exposed on the web UI.  However, the operators often
want to see those global events while tailing a log file on a single
node.

Implement a mechanism for the server running on each node
to emit those system events into server log file.

If the system log scanning is enabled (via server.eventlogsink.enabled setting),
then each node scans the system log table periodically,
every server.eventlogsink.period period;

For example, below is a single system event emitted to the regular log file.:
  I200323 .... [n1] system.eventlog:n=1:'set_cluster_setting':2020-03-23 19:24:29.948279
    +0000 UTC '{"SettingName":"server.eventlogsink.max_entries","Value":"101","User":"root"}'

There is no guaranteed that all events from system log will be eimitted.
In particular, upon node restart, we only emit events that were generated
from that point on.  Also, if for whatever reason,we start emitting
too many system log messages, then only up to the
server.eventlogsink.max_entries (default 100) recent events will be emitted.
However, if we think we have "dropped" some events due to confuration settings,
we will indicate so in the log.

Release notes (feature): Log system wide events into cockroach.log
file on every node.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Apr 1, 2020
Fixes cockroachdb#45643

Cockroach server logs important system events into the eventlog table.
These events are exposed on the web UI.  However, the operators often
want to see those global events while tailing a log file on a single
node.

Implement a mechanism for the server running on each node
to emit those system events into server log file.

If the system log scanning is enabled (via server.eventlogsink.enabled setting),
then each node scans the system log table periodically,
every server.eventlogsink.period period;

For example, below is a single system event emitted to the regular log file.:
  I200323 .... [n1] system.eventlog:n=1:'set_cluster_setting':2020-03-23 19:24:29.948279
    +0000 UTC '{"SettingName":"server.eventlogsink.max_entries","Value":"101","User":"root"}'

There is no guaranteed that all events from system log will be eimitted.
In particular, upon node restart, we only emit events that were generated
from that point on.  Also, if for whatever reason,we start emitting
too many system log messages, then only up to the
server.eventlogsink.max_entries (default 100) recent events will be emitted.
If we think we have "dropped" some events due to confuration settings,
we will indicate so in the log.

The administrators may choose to restrict the set of events emitted
by changing server.eventlogsink.include_events and/or
server.eventlogsink.exclude_events settings.  These settings specify
regular expressions to include or exclude events with matching event
types.

Release notes (feature): Log system wide events into cockroach.log
file on every node.

This feature allows the administrator logged in into one of the
nodes to monitor that nodes log file and see important "system" events,
such as table/index creationg, schema change jobs, etc.

To use this feature, the server.eventlogsink.enabled setting needs
to be set to true.
@knz
Copy link
Contributor

knz commented Apr 1, 2020

I have discussed this with @petermattis today.
For context:

  • the customer that motivated filing this issue does not consider this urgent. We can schedule it for 20.2.

  • the original user journey is "an operator wants CLI access to a stream of cluster events for manual and automatic monitoring".

  • it's not sufficient to provide "network logging" (i.e. logging to a network sink) as a feature, there will need to be a way for an operator looking at 1 node to see events coming from the entire cluster.

  • double-reporting is OK as long as there's a timestamp or ID that can be used to de-duplicate.

  • under-reporting is a problem. The customer can tolerate double-reporting but not missed events.

  • a changefeed-based solution, or a solution that can capitalize on one of the mechanisms already present in crdb, is preferable to the introduction of a new ad-hoc mechanism.

  • an intermediate, ad-hoc solution with known deficiencies is not "palatable". In peter's words "I’d find it more palatable if it didn’t have deficiencies, but even then I wouldn’t push to get it in."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-logging In and around the logging infrastructure. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-disaster-recovery
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants