Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,log: productionize the event logging #57737

Merged
merged 1 commit into from
Dec 11, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Dec 9, 2020

Informs #57629

CockroachDB needs to report structured events for
"important" (notable) events on the logical structure of a cluster,
including changes to the SQL logical schema, node activity, privilege
changes etc.

Prior to this patch, these events were reported mainly into the
table system.eventlog, with a partial copy of the payload into the
debugging external log (cockroach.log).

This solution was incomplete and unsatisfactory in many ways:

  • the event payloads were not documented.

  • the event payloads were not centrally defined, which was preventing
    the generation of automatic documentation.

  • the payload types were declared "inline" in the log calls, which
    made it easy for team members to inadvertently change the structure of
    the payload and make them backward-incompatible for users consuming
    this data externally.

  • the payload fields were inconsistently named across event types.

  • the metadata fields on the payloads were incompletely and
    inconsistently populated:

    • the SQL instance ID was missing in some cases.
    • the descriptor ID of affected descriptor was missing in some
      cases.
  • the same event type was mistakenly used for different events (e.g.
    "rename_database" for both RENAME DATABASE and CONVERT TO SCHEMA)

  • the same event type was abusingly over-used for multiple separate
    operations, e.g. a single event would be generated for a multi-table,
    multi-user GRANT or REVOKE operation.

  • events could be omitted entirely from external logs, e.g.
    if the system.eventlog table was not available during a node
    restart.

  • the copy in the external log was not parseable. Generally, the
    logging package was unaware of the internal structure of events
    and would “flatten” them.

  • no provision was available to partially redact events. From the
    logging system's perspective, the entire payload is sensitive.

This commit changes the situation as follows:

  • it centralizes the payload definitions and standardizes them into a
    new package eventspb.
  • it groups events into categories and associates logging
    channels to each event category.
  • it enables automatic generation of documentation for events.
  • it ensures that field names are consistent across event payloads.
  • it ensures that event metadata is consistently populted.
  • it decomposes complex GRANT/REVOKE operations into individual
    events.
  • it automates the generation of a reference documentation for all
    event types.
  • it ensures that events are sent to external logs unconditionally.

The following problems continue to exist and need to be resolved
separately:

Release note (sql change): The cluster event logging system has been
modernized. In particular, the schema of the entries for the info
column in system.eventlog has been stabilized.

Release note (cli change): The events that were previously only stored
in system.eventlog are now also directed unconditionally to an
external logging channel using a JSON format. Refer to the
configuration to see how to customize how events are directed to
external sinks. Note that the exact external output format (and thus
how to detect/parse the events from e.g. log files) is not yet
stabilized and remains subject to change.

Release note (doc change): The cluster event logging system has been
standardized. A reference documentation is now
available (auto-generated from source code); changes to non-reserved
payloads will now be announced at least one release version in
advance. The event types are organized into broad categories: SQL
Logical Schema Changes, SQL Privilege Changes, SQL User Management,
CLuster-level events and SQL Miscellaneous operations.

Release note (backward-incompatible change): The payload fields for
certain event types in system.eventlog have been changed and/or
renamed. Note that the payloads in system.eventlog were an
undocumented, reserved feature so no guarantee was made about
cross-version compatibility to this point. The list of changes
includes (but is not limited to):

  • TargetID has been renamed to NodeID for node_join
  • TargetID has been renamed to TargetNodeID for
    node_decommissioning / node_decommissioned /
    node_recommissioned.
  • NewDatabaseName has been renamed to NewDatabaseParent for
    convert_to_schema.
  • grant_privilege and revoke_privilege have been removed;
    they are replaced by change_database_privilege,
    change_schema_privilege, change_type_privilege and
    change_table_privilege.
    Each event only reports change for one user/role, so the
    Grantees field was renamed to Grantee.
  • Each drop_role event now pertains to a single user/role.

Release note (sql change): The targetID and reportingID columns in
system.eventlog are now deprecated. Their values, for relevant event
types, can be found as fields inside the info column instead.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Dec 9, 2020

@itsbilal this is the draft branch we talked about yesterday.
Can you have a look at the overall approach?

What I am missing here is an approach that works for the 3 odd events, those that are node-level: node restart, join, decommission. How to best factor that code?

Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me so far! I like the CommonDetails/CommonSQLDetails etc setup of the protobuf structs. I made one suggestion around handling the node events by having a more specialized method than InsertEventRecord that fills in node-specific details. From my code observation so far, that seems to make sense to be, but let me know if it doesn't.

Reviewed 16 of 16 files at r1, 21 of 21 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/sql/event_log.go, line 233 at r2 (raw file):

// InsertEventRecord inserts a single event into the event log as part of the
// provided transaction.
func (ev EventLogger) InsertEventRecord(

I'm guessing all uses of this method will soon be from just this file. Might be worth unexporting the method and having a separate one, LogNodeEvent, that takes in a partially-filled NodeEvent, a node descriptor/ctx/etc, and fills in common details + common node details?

@knz knz force-pushed the 20201204-eventlog branch from 55e0735 to 95665be Compare December 10, 2020 00:42
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for a first round of review.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)


pkg/sql/event_log.go, line 233 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I'm guessing all uses of this method will soon be from just this file. Might be worth unexporting the method and having a separate one, LogNodeEvent, that takes in a partially-filled NodeEvent, a node descriptor/ctx/etc, and fills in common details + common node details?

Done.

@knz knz marked this pull request as ready for review December 10, 2020 00:56
@knz knz requested a review from a team December 10, 2020 00:56
@knz knz requested a review from a team as a code owner December 10, 2020 00:56
@knz knz force-pushed the 20201204-eventlog branch 4 times, most recently from 279e1d0 to d4186bc Compare December 10, 2020 14:59
Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far! Just some minor comments. Looking forward to seeing the rest of the changes come in.

Reviewed 45 of 45 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/sql/event_log.go, line 150 at r3 (raw file):

`
	args := []interface{}{
		info.CommonDetails().Timestamp,

Might be a good idea returning an error here if this is unset?


pkg/util/log/eventpb/events.proto, line 21 at r3 (raw file):

//
// Notes:
// - the fields have jsontag ",..." to override the default

,omitempty intead of ... ? I'm not sure if I understand this comment as-is.

@knz knz force-pushed the 20201204-eventlog branch from d4186bc to 3ece831 Compare December 11, 2020 09:36
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @itsbilal)


pkg/sql/event_log.go, line 150 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Might be a good idea returning an error here if this is unset?

Good idea. Done.


pkg/util/log/eventpb/events.proto, line 21 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

,omitempty intead of ... ? I'm not sure if I understand this comment as-is.

Explained in comment.

CockroachDB needs to report structured events for
"important" (notable) events on the logical structure of a cluster,
including changes to the SQL logical schema, node activity, privilege
changes etc.

Prior to this patch, these events were reported mainly into the
table `system.eventlog`, with a partial copy of the payload into the
debugging external log (`cockroach.log`).

This solution was incomplete and unsatisfactory in many ways:

- the event payloads were not documented.
- the event payloads were not centrally defined, which was preventing
  the generation of automatic documentation.
- the payload types were declared "inline" in the log calls, which
  made it easy for team members to inadvertently change the structure of
  the payload and make them backward-incompatible for users consuming
  this data externally.
- the payload fields were inconsistently named across event types.
- the metadata fields on the payloads were incompletely and
  inconsistently populated:

  - the SQL instance ID was missing in some cases.
  - the descriptor ID of affected descriptor was missing in some
    cases.

- the same event type was mistakenly used for different events (e.g.
  "rename_database" for both RENAME DATABASE and CONVERT TO SCHEMA)
- the same event type was abusingly over-used for multiple separate
  operations, e.g. a single event would be generated for a multi-table,
  multi-user GRANT or REVOKE operation.
- events could be omitted entirely from external logs, e.g.
  if the system.eventlog table was not available during a node
  restart.
- the copy in the external log was not parseable. Generally, the
  logging package was unaware of the internal structure of events
  and would “flatten” them.
- no provision was available to partially redact events. From the
  logging system's perspective, the entire payload is sensitive.

This commit changes the situation as follows:

- it centralizes the payload definitions and standardizes them into a
  new package `eventspb`.
- it groups events into categories and associates logging
  channels to each event category.
- it enables automatic generation of documentation for events.
- it ensures that field names are consistent across event payloads.
- it ensures that event metadata is consistently populted.
- it decomposes complex GRANT/REVOKE operations into individual
  events.
- it automates the generation of a reference documentation for all
  event types.
- it ensures that events are sent to external logs unconditionally.

The following problems continue to exist and need to be resolved
separately:

- privilege changes that occur as a side effect of certain operations
  do not get events logged:

  cockroachdb#57573
  cockroachdb#57576
  cockroachdb#57739
  cockroachdb#57741

- the name fields in certain DDL events is not properly qualified,
  which prevents the determination of the logical schema or database
  where the object was altered:

  cockroachdb#57734
  cockroachdb#57735
  cockroachdb#57738
  cockroachdb#57740

Release note (sql change): The cluster event logging system has been
modernized. In particular, the schema of the entries for the `info`
column in `system.eventlog` has been stabilized.

Release note (cli change): The events that were previously only stored
in `system.eventlog` are now also directed unconditionally to an
external logging channel using a JSON format.  Refer to the
configuration to see how to customize how events are directed to
external sinks. Note that the exact external output format (and thus
how to detect/parse the events from e.g. log files) is not yet
stabilized and remains subject to change.

Release note (doc change): The cluster event logging system has been
standardized. A reference documentation is now
available (auto-generated from source code); changes to non-reserved
payloads will now be announced at least one release version in
advance. The event types are organized into broad categories: SQL
Logical Schema Changes, SQL Privilege Changes, SQL User Management,
CLuster-level events and SQL Miscellaneous operations.

Release note (backward-incompatible change): The payload fields for
certain event types in `system.eventlog` have been changed and/or
renamed. Note that the payloads in `system.eventlog` were an
undocumented, reserved feature so no guarantee was made about
cross-version compatibility to this point. The list of changes
includes (but is not limited to):

- `TargetID` has been renamed to `NodeID` for `node_join`
- `TargetID` has been renamed to `TargetNodeID` for
  `node_decommissioning` / `node_decommissioned` /
  `node_recommissioned`.
- `NewDatabaseName` has been renamed to `NewDatabaseParent` for
  `convert_to_schema`.
- `grant_privilege` and `revoke_privilege` have been removed;
  they are replaced by `change_database_privilege`,
  `change_schema_privilege`, `change_type_privilege` and
  `change_table_privilege`.
  Each event only reports change for one user/role, so the
  `Grantees` field was renamed to `Grantee`.
- Each `drop_role` event now pertains to a single user/role.

Release note (sql change): The `targetID` and `reportingID` columns in
`system.eventlog` are now deprecated. Their values, for relevant event
types, can be found as fields inside the `info` column instead.
@knz knz force-pushed the 20201204-eventlog branch from a708667 to b6f03db Compare December 11, 2020 18:15
@knz
Copy link
Contributor Author

knz commented Dec 11, 2020

bors r=itsbilal

@craig
Copy link
Contributor

craig bot commented Dec 11, 2020

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants