telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815

knz · 2021-04-17T15:48:51Z

This was discussed with @vilamurugu and @petermattis

Instead of pushing state changes to a central registration server, we could instead emit telemetry events in an append-only store (e.g. on S3) and mine the telemetry data using event aggregation.

This approach could be called “telemetry logging” to contrast it with the (current) “telemetry updates”.

How would this work?

@petermattis suggests leveraging the new infrastructure for structured logging. We could operate as follows:

define a new logging channel TELEMETRY.
create a new structured log event type in eventpb that embeds telemetry data (this could possibly be as simple as a new proto message type that embeds a field of the type defined in the diagnosticspb package)
change the telemetry update loop to use log.StructuredEvent instead of a custom HTTP PUT/GET client and request.
for CockroachCloud customers, set up a logging config on each server that redirects TELEMETRY events to our shared log collector.
for on-prem customers and backward compatibility, implement a new logging sink for a HTTP collector (log: implement a HTTP sink #63980), and ensure that the default config redirects TELEMETRY events to our existing registration HTTP endpoint via the network logging subsystem

This plan depends on the existence of a log collector for CockroachCloud, however we already have implemented such a thing for other security purposes. We can thus extend its usage.

Once this plan is implemented, we would still receive the telemetry for on-prem users, albeit using HTTP requests issued by the logging system (instead of the ad-hoc HTTP client in our current diagnostics package).

Additionally, we would receive the telemetry events from CC clusters through the log collector and these can be redirected to a CRL-internal S3 bucket by a configuration at the level of the log collector.

Another benefit is that it would allow on-prem users with network restrictions to collect telemetry events using their own log collector, and deliver them to us manually upon request. (Currently telemetry from isolated on-prem clusters is simply lost.)

Epic: CC-4303

Jira issue: CRDB-6782

The text was updated successfully, but these errors were encountered:

knz · 2021-04-17T16:07:05Z

Once we have a TELEMETRY log channel configured as described above, we'd have an alternate approach to "time to first query", simpler than that proposed in #63812:

the init node would log a "cluster initialized" event on the TELEMETRY channel upon cluster initialization.
every node would log a "first query executed" event on the TELEMETRY channel, the first time it runs a query after starting up
upon log aggregation, we'd simply scan the log event stream, and see the log timestamp difference between the "cluster init" event and the first "first query exec" event.

A drawback of this approach is that a failure in the network log collector could cause the "cluster init" event or some of the "first query" events to be lost.

logston · 2021-04-19T22:05:19Z

Looks great @knz!

I have a couple of notes/questions:

we could instead emit telemetry events in an append-only store (e.g. on S3) and mine the telemetry data using event aggregation. / for CockroachCloud customers, set up a logging config on each server that redirects TELEMETRY events to our shared log collector.

Like you mention, for CC this will be a small lift for 21.1+ clusters once the we can direct TELEMETRY events via a logging channel configuration. 20.2 CC clusters will require a lot of care but perhaps we can delay porting them until they've migrated to 21.1, obviating the need for that extra work.

for on-prem customers and backward compatibility, implement a new logging sink for a HTTP collector, and ensure that the default config redirects TELEMETRY events to our existing registration HTTP endpoint via the network logging subsystem

I see this being the case for some time to come. For on prem, it will be tricky to ship logs to our global sink without a proxy given the need for authn/authz when POSTing to the sink. We could provide on-prem customers with a suggested solution/config for relaying telemetry logs such that the isolated vs. non-isolated customers have a very similar setup. Isolated customers would simply have once less sink in their fluentbit/fluentd/logstash config.

and ensure that the default config redirects TELEMETRY events to our existing registration HTTP endpoint via the network logging subsystem

To "close the loop", we could have this endpoint proxy the events to our global sink such that all logs (both CC and on-prem) end up in the same location. Perhaps this was what you were alluding to already and I misunderstood.

Additionally, we would receive the telemetry events from CC clusters through the log collector and these can be redirected to a CRL-internal S3 bucket by a configuration at the level of the log collector.

Nit: Instead, can we dump these logs to a per cluster bucket as well as the global sink? This would require minimal changes in CC.

logston · 2021-04-20T00:11:37Z

I see now that there is a more in depth doc that precedes this issue. Feel free to ignore my comments here in favor of my comments in the other document.

knz · 2021-04-20T16:03:21Z

I think the other doc is much more narrow in scope. The "log-based telemetry project", if we start it, will have more stakeholders and a wider impact.

Nit: Instead, can we dump these logs to a per cluster bucket as well as the global sink? This would require minimal changes in CC.

Yes we can do this. The issue at top is a general vision, and the decision of which events gets routed where is a matter of configuration -- i.e. a decision that the consumer of the feature (The SREs and the data team) make together separately.

See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: - creating a new logging channel TELEMETRY - ensuring the channel is not redirected to files by default. - copies the diagnostic report to the TELEMETRY channel, *in addition* of sending it to `register.cockroachdb.com` as previously. Later work will aim to further this project by subsuming the HTTP reporting to `register.cockroachdb.com` entirely under logging, e.g. via a HTTP sink (see issue cockroachdb#63980). Release note (cli change): CockroachDB now also sends telemetry diagnostics to the new logging channel TELEMETRY. By default, this channel is not connected to file output, so there is no observable difference in behavior for default deployments. To redirect the diagnostic output, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

andy-kimball · 2021-05-07T21:41:33Z

A downside to changing how telemetry works: we'll stop getting telemetry data for older CRDB versions (we still get lots of reports from versions that are years old), since older versions will only know how to log to the registry service, which presumably will be shut down. Or, if it's not shut down, we'll be maintaining two parallel pipelines, which means even more work and complication.

Also, the biggest problems I've seen with the reporting pipeline are:

There is too much data, making it take 10's of minutes to run even simple queries.
The data is low quality, because there are many test/dev clusters operating, and so the Product team doesn't trust it and doesn't use it much.

Is it really the telemetry technology that is problematic (i.e. RPC call vs. logging-based)? We could switch to logging-based telemetry, and I think we'd still be in the same bad place, because that solution doesn't address either problem #1 or problem #2 above.

knz · 2021-05-07T22:46:21Z

Or, if it's not shut down, we'll be maintaining two parallel pipelines, which means even more work and complication.

The way it's being envisioned is this:

for on-prem users, we will continue HTTP uploads. The implementation thereof inside crdb will move from the diagnostics package to the util/log package via a HTTP sink (log: implement a HTTP sink #63980)
however, this does not mean we are going to maintain 2 pipelines; instead the HTTP collector will be changed to append the data to S3 (or the storage bucket) instead of storing it inside a crdb cluster. This will unify the aggregation across on-prem and CC clusters.

This way, as a technology end result:

inside crdb, the emission of telemetry is stillimplemented in 1 place
inside CRL, the data arrives in 1 place
we preserve compatibility with previous crdb versions by keeping register.cockroachdb.com available
the data for old versions can arrive in the S3 bucket so that PMs can see it

the biggest problems I've seen with the reporting pipeline are: [...]

The project discussed here is somewhat orthogonal to these two problems.

FYI:

the solution to your "problem 1" will be to cull old data from the collected telemetry.
the solution to "problem 2" is to filter the data upfront, either in register.cockroachdb.com or later in the pipeline, before it gets to the storage buckets.

I think it's going to be useful to call out "problem 2" during the architectural discussion next week.

thtruo · 2021-05-27T14:13:01Z

cc @JuanLeon1

See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. By default, this channel is connected to file output, with a maximum retention of 1MiB. To redirect the diagnostic output, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` (At the time of this writing, no events are defined for the TELEMETRY channel yet.)

See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. (At the time of this writing, no events are defined for the TELEMETRY channel yet.) When no logging configuration is specified, this channel is connected to file output, with a maximum retention of 1MiB. To also produce the diagnostic output elsewhere, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` When configuring file output, the operator should be careful to apply a separate maximum retention for the TELEMETRY channel from other file outputs, as telemetry data can be verbose and outcrowd other logging messages. For example: `--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY, max-group-size: 1MB}, ...}}`

@knz

66427: util/log,cli: new TELEMETRY channel r=maryliag a=knz Supersedes #64218 Informs #63815. See discussion in #63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. (see PRs #67809 and #67507 for example follow-ups) Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. (At the time of this writing, no events are defined for the TELEMETRY channel yet.) When no logging configuration is specified, this channel is connected to file output, with a maximum retention of 1MiB. To also produce the diagnostic output elsewhere, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` When configuring file output, the operator should be careful to apply a separate maximum retention for the TELEMETRY channel from other file outputs, as telemetry data can be verbose and outcrowd other logging messages. For example: `--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY, max-group-size: 1MB}, ...}}` 67913: roachtest: remove mention of 'roachstress' r=tbg a=knz The proposed 'roachstress' script linked from the repro steps at https://gist.github.com/tbg/fc4409e06b4ddc9abc58838cf5c0fe8b is utterly broken: it confuses local with remote execution. The correct steps must clarify that `roachprod` and `roachtest` binaries must be built for the local platform, and the `cockroach` and `workload` binaries must be built for the remote platform. Someone trying to run this on macos, or with a linux without the old glibc, gets inscrutable error messages with the current script. Additionally, the proposed roachstress script uses a `--cpu-quota` parameter that's incoherent with `roachtest`'s own default. Finally, I (@knz) directly witnessed a new hire who was utterly confused by the amount of automation and assumptions made in the roachtest script. (Should they install caffeinate? Why? How to customize the roachtest flags? How to change the number of runs? Can the test run locally on my computer?) In summary, the `roachstress` script is just too magical and actively hostile to newcomers. This patch removes it and replaces it by clearer instructiosn. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

knz · 2021-11-01T17:36:43Z

Heads up that the reason why this issue is still open is that we have not yet migrated all the payloads in the diagnostic report to a structured logging solution yet.

thtruo · 2021-11-01T20:22:04Z

cc @dhartunian @nkodali as this was brought up in the logging knowledge transfer session

github-actions · 2023-08-29T11:11:08Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-telemetry A-logging In and around the logging infrastructure. labels Apr 17, 2021

knz mentioned this issue Apr 17, 2021

[WIP] server,sql: make the time-to-first-query observable in telemetry #63812

Closed

knz mentioned this issue Apr 21, 2021

log: implement a HTTP sink #63980

Closed

knz mentioned this issue Apr 26, 2021

[WIP] util/log,diagnostics: new TELEMETRY channel, send reports to logging #64218

Closed

knz mentioned this issue Jun 14, 2021

util/log,cli: new TELEMETRY channel #66427

Merged

knz mentioned this issue Jul 20, 2021

sql,log: new structured event FirstQuery #67809

Closed

knz changed the title ~~rfc(telemetry): Explore a logging-based telemetry solution~~ telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat Nov 1, 2021

github-actions bot added the no-issue-activity label Aug 29, 2023

knz added T-observability-inf and removed no-issue-activity labels Aug 29, 2023

blathers-crl bot added the A-observability-inf label Aug 29, 2023

exalate-issue-sync bot added T-observability and removed T-observability-inf labels Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815

telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815

knz commented Apr 17, 2021 •

edited by cockroach-jira-scripts

Loading

knz commented Apr 17, 2021 •

edited

Loading

logston commented Apr 19, 2021

logston commented Apr 20, 2021

knz commented Apr 20, 2021

andy-kimball commented May 7, 2021

knz commented May 7, 2021

thtruo commented May 27, 2021

knz commented Nov 1, 2021

thtruo commented Nov 1, 2021

github-actions bot commented Aug 29, 2023

telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815

telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815

Comments

knz commented Apr 17, 2021 • edited by cockroach-jira-scripts Loading

knz commented Apr 17, 2021 • edited Loading

logston commented Apr 19, 2021

logston commented Apr 20, 2021

knz commented Apr 20, 2021

andy-kimball commented May 7, 2021

knz commented May 7, 2021

thtruo commented May 27, 2021

knz commented Nov 1, 2021

thtruo commented Nov 1, 2021

github-actions bot commented Aug 29, 2023

knz commented Apr 17, 2021 •

edited by cockroach-jira-scripts

Loading

knz commented Apr 17, 2021 •

edited

Loading