Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/log,cli: new TELEMETRY channel #66427

Merged
merged 1 commit into from
Jul 22, 2021

Conversation

knz
Copy link
Contributor

@knz knz commented Jun 14, 2021

Supersedes #64218

Informs #63815.

See discussion in #63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

(see PRs #67809 and #67507 for example follow-ups)

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. (At the time of this writing, no events are defined for the
TELEMETRY channel yet.)

When no logging configuration is specified, this channel is connected
to file output, with a maximum retention of 1MiB.

To also produce the diagnostic output elsewhere, one can define a new
sink that captures this channel.
For example, to see diagnostics reports on the standard error, one can
use: --log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'

When configuring file output, the operator should be careful to apply a
separate maximum retention for the TELEMETRY channel from other file
outputs, as telemetry data can be verbose and outcrowd other logging
messages. For example:

--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY, max-group-size: 1MB}, ...}}

@knz knz requested review from maryliag and thtruo June 14, 2021 15:04
@knz knz requested a review from a team as a code owner June 14, 2021 15:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz requested a review from rauchenstein June 14, 2021 15:06
@knz
Copy link
Contributor Author

knz commented Jun 14, 2021

NB: this currently lacks a unit test for the first query event, pending review of the open questions by @thtruo

@thtruo
Copy link
Contributor

thtruo commented Jun 14, 2021

Thanks @knz. Some thoughts on the first question:

With the current patch, the "first query" event will be reported
for these administrative SQL queries. Is this desired?

^ That's good to know about the current patch. Reporting administrative SQL queries is not the desired signal we want to observe. Our goal is to understand user-initiated queries, and as such, we should report user initiated queries.

(A similar question exists when the user opens cockroach sql, but
does not type anything in. The cockroach sql command sends queries
in the background, and would trigger the "first query" event even
without user interaction. Are we OK with this?)

^ Similar to the previous answer - we want "first query" to correspond to user interaction.

cc @emily-horing if she has another thought in mind

@knz
Copy link
Contributor Author

knz commented Jun 14, 2021

@thtruo thanks for clarifying but that really reveals an area what we never looked at before.

We've never really considered this need to distinguish "user-initiated". From crdb's perspective, the queries issued by CC intrusion are "user-initiated". How would crdb know otherwise?

I'm not sure how to think about this. @jordanlewis do you have ideas? Or do you know who could help with this?

@knz knz force-pushed the 20210614-telemetry-channel branch from 36f603e to 955825c Compare June 14, 2021 18:44
@thtruo
Copy link
Contributor

thtruo commented Jun 14, 2021

Thoughts regarding the second question:

To retrieve the “time to first query” (i.e. not “time of
first query”), the consumer of these events must then compute the
minimum timestamp of all the first query events received, and
subtract the timestamp of cluster creation. (The cluster creation
timestamp can be obtained either via an external source, e.g. when
the customer initiated the orchestration to create the cluster, or
by retrieving the first node_restart event from the OPS logging
channel.) Is this behavior acceptable?

^ That general approach to subtract cluster creation timestamp from min(first query timestamp) downstream sounds reasonable. FWIW we already have a table that tracks when a cloud cluster is created, so it appears that we already have a source for that information.

cc @vilamurugu for awareness on this approach

@knz
Copy link
Contributor Author

knz commented Jul 12, 2021

I'm going to split this PR into two: one to add the telemetry channel (which @THardy98 needs)
and another one for the first query.

@knz
Copy link
Contributor Author

knz commented Jul 20, 2021

I have split the work on "first query" telemetry into a separate PR: #67809

This way, this current PR only adds a new logging channel, and we can merge this to unblock progress on @THardy98 's #67507.

@knz
Copy link
Contributor Author

knz commented Jul 20, 2021

@rauchenstein @THardy98 can you review this PR? Thanks

@knz knz removed the X-noremind Bots won't notify about PRs with X-noremind label Jul 20, 2021
@knz knz force-pushed the 20210614-telemetry-channel branch from 82663f9 to 21d9f59 Compare July 20, 2021 17:06
@knz knz changed the title util/log,cli: new TELEMETRY channel and "first query" telemetry event util/log,cli: new TELEMETRY channel Jul 20, 2021
@knz knz force-pushed the 20210614-telemetry-channel branch from 21d9f59 to cdc43cb Compare July 20, 2021 17:16
Copy link

@THardy98 THardy98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though my review shouldn't be conclusive.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @rauchenstein, and @thtruo)


pkg/cli/log_flags.go, line 449 at r4 (raw file):

		if ch == channel.TELEMETRY {
			// Keep less data for telemetry.

how was those values chosen? Should they be configurable?

@knz knz force-pushed the 20210614-telemetry-channel branch 2 times, most recently from 0cddb83 to 0328b1a Compare July 22, 2021 11:59
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @maryliag, @rauchenstein, and @thtruo)


pkg/cli/log_flags.go, line 449 at r4 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

how was those values chosen? Should they be configurable?

They are - like all the file groups, you're looking only at the default configuration here. Any customization will override this.

I have updated the release note in the commit message to explain this. Can you have a look?

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting in two PRs.
:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rauchenstein and @thtruo)

See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. (At the time of this writing, no events are defined for the
TELEMETRY channel yet.)

When no logging configuration is specified, this channel is connected
to file output, with a maximum retention of 1MiB.

To also produce the diagnostic output elsewhere, one can define a new
sink that captures this channel.
For example, to see diagnostics reports on the standard error, one can
use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

When configuring file output, the operator should be careful to apply a
separate maximum retention for the TELEMETRY channel from other file
outputs, as telemetry data can be verbose and outcrowd other logging
messages. For example:

`--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY,
max-group-size: 1MB}, ...}}`
@knz knz force-pushed the 20210614-telemetry-channel branch from 0328b1a to dfc7f2b Compare July 22, 2021 13:43
@knz
Copy link
Contributor Author

knz commented Jul 22, 2021

TFYR

bors r=maryliag

@craig
Copy link
Contributor

craig bot commented Jul 22, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants