Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] util/log,diagnostics: new TELEMETRY channel, send reports to logging #64218

Closed
wants to merge 1 commit into from

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 26, 2021

Informs #63815.

See discussion in #63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase:

  • creating a new logging channel TELEMETRY
  • ensuring the channel is not redirected to files by default.
  • copies the diagnostic report to the TELEMETRY channel,
    in addition of sending it to register.cockroachdb.com
    as previously.

Later work will aim to further this project by subsuming
the HTTP reporting to register.cockroachdb.com entirely
under logging, e.g. via a HTTP sink (see issue #63980).

Release note (cli change): CockroachDB now also sends
telemetry diagnostics to the new logging channel TELEMETRY.
By default, this channel is not connected to file output,
so there is no observable difference in behavior
for default deployments. To redirect the diagnostic
output, one can define a new sink that captures
this channel.

For example, to see diagnostics reports on the standard
error, one can use: --log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'

@knz knz requested review from thtruo, logston and vilamurugu April 26, 2021 15:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20210426-telemetry-logging branch from 83c6ea4 to 18e7afc Compare April 26, 2021 15:50
Copy link
Contributor

@logston logston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the CRDB context give this a thumbs up or not but I like what I see! Thanks for this work @knz.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @thtruo and @vilamurugu)

See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase:

- creating a new logging channel TELEMETRY
- ensuring the channel is not redirected to files by default.
- copies the diagnostic report to the TELEMETRY channel,
  *in addition* of sending it to `register.cockroachdb.com`
  as previously.

Later work will aim to further this project by subsuming
the HTTP reporting to `register.cockroachdb.com` entirely
under logging, e.g. via a HTTP sink (see issue cockroachdb#63980).

Release note (cli change): CockroachDB now also sends
telemetry diagnostics to the new logging channel TELEMETRY.
By default, this channel is not connected to file output,
so there is no observable difference in behavior
for default deployments. To redirect the diagnostic
output, one can define a new sink that captures
this channel.

For example, to see diagnostics reports on the standard
error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`
@knz knz force-pushed the 20210426-telemetry-logging branch from 18e7afc to e934980 Compare May 3, 2021 15:40
@thtruo
Copy link
Contributor

thtruo commented May 27, 2021

cc @JuanLeon1

Copy link
Contributor

@thtruo thtruo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, Raphael! I like the inclusion of having diagnostic usage reports being sent as distinct logs through the TELEMETRY channel.

Reviewed 22 of 22 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @vilamurugu)

@thtruo
Copy link
Contributor

thtruo commented May 28, 2021

@knz One question came up in a conversation with @logston around the Serverless case. Will diagnostic usage reports be sent per tenant or per node?
If they are per tenant, we may want to consider doing some sampling to reduce the volume of data (and cost) going through telemetry into the storage bucket

@knz
Copy link
Contributor Author

knz commented May 28, 2021

Will diagnostic usage reports be sent per tenant or per node?

We want both, for different purposes:

  • To analyze our customer's behavior PM wants data from SQL tenants, not the KV nodes underneath. So the most data we'll collect will be per tenant.
  • Separately, I think we want KV nodes to also telemetry that reflects CRDB usage by the CC control plane, because we'll want to utilize this telemetry in crdb engineering to optimize CC's performance overall (especially in the KV team)

If they are per tenant, we may want to consider doing some sampling to reduce the volume of data (and cost) going through telemetry into the storage bucket

You're preaching to the choir. That's exactly why we absolutely must use sampling for query logging.

The other events of interest (feature counters, time to first query etc) are infrequent enough.

@thtruo
Copy link
Contributor

thtruo commented Jun 1, 2021

cc @andy-kimball for awareness as this is one of the prerequisite pieces towards enabling the logging telemetry pipeline to support product analytics for CC Serverless

@knz
Copy link
Contributor Author

knz commented Jun 15, 2021

superseded by #66427

@knz knz closed this Jun 15, 2021
craig bot pushed a commit that referenced this pull request Jul 22, 2021
66427: util/log,cli: new TELEMETRY channel r=maryliag a=knz

Supersedes #64218


Informs #63815.

See discussion in #63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

(see PRs #67809 and #67507 for example follow-ups)

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. (At the time of this writing, no events are defined for the
TELEMETRY channel yet.)

When no logging configuration is specified, this channel is connected
to file output, with a maximum retention of 1MiB.

To also produce the diagnostic output elsewhere, one can define a new
sink that captures this channel.
For example, to see diagnostics reports on the standard error, one can
use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

When configuring file output, the operator should be careful to apply a
separate maximum retention for the TELEMETRY channel from other file
outputs, as telemetry data can be verbose and outcrowd other logging
messages. For example:

`--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY,
max-group-size: 1MB}, ...}}`


67913: roachtest: remove mention of 'roachstress' r=tbg a=knz

The proposed 'roachstress' script linked from the repro steps at
https://gist.github.com/tbg/fc4409e06b4ddc9abc58838cf5c0fe8b is
utterly broken: it confuses local with remote execution.

The correct steps must clarify that `roachprod` and `roachtest`
binaries must be built for the local platform, and the `cockroach` and
`workload` binaries must be built for the remote platform. Someone
trying to run this on macos, or with a linux without the old glibc,
gets inscrutable error messages with the current script.

Additionally, the proposed roachstress script uses a `--cpu-quota`
parameter that's incoherent with `roachtest`'s own default.

Finally, I (@knz) directly witnessed a new hire who was utterly
confused by the amount of automation and assumptions made in the
roachtest script. (Should they install caffeinate? Why? How to
customize the roachtest flags? How to change the number of runs? Can
the test run locally  on my computer?)

In summary, the `roachstress` script is just too magical and actively
hostile to newcomers.

This patch removes it and replaces it by clearer instructiosn.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants