Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815

Open
knz opened this issue Apr 17, 2021 · 10 comments
Labels
A-logging In and around the logging infrastructure. A-observability-inf A-telemetry C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability

Comments

@knz
Copy link
Contributor

knz commented Apr 17, 2021

This was discussed with @vilamurugu and @petermattis

Instead of pushing state changes to a central registration server, we could instead emit telemetry events in an append-only store (e.g. on S3) and mine the telemetry data using event aggregation.

This approach could be called “telemetry logging” to contrast it with the (current) “telemetry updates”.

How would this work?

@petermattis suggests leveraging the new infrastructure for structured logging. We could operate as follows:

  • define a new logging channel TELEMETRY.
  • create a new structured log event type in eventpb that embeds telemetry data (this could possibly be as simple as a new proto message type that embeds a field of the type defined in the diagnosticspb package)
  • change the telemetry update loop to use log.StructuredEvent instead of a custom HTTP PUT/GET client and request.
  • for CockroachCloud customers, set up a logging config on each server that redirects TELEMETRY events to our shared log collector.
  • for on-prem customers and backward compatibility, implement a new logging sink for a HTTP collector (log: implement a HTTP sink #63980), and ensure that the default config redirects TELEMETRY events to our existing registration HTTP endpoint via the network logging subsystem

This plan depends on the existence of a log collector for CockroachCloud, however we already have implemented such a thing for other security purposes. We can thus extend its usage.

Once this plan is implemented, we would still receive the telemetry for on-prem users, albeit using HTTP requests issued by the logging system (instead of the ad-hoc HTTP client in our current diagnostics package).

Additionally, we would receive the telemetry events from CC clusters through the log collector and these can be redirected to a CRL-internal S3 bucket by a configuration at the level of the log collector.

Another benefit is that it would allow on-prem users with network restrictions to collect telemetry events using their own log collector, and deliver them to us manually upon request. (Currently telemetry from isolated on-prem clusters is simply lost.)

Epic: CC-4303

Jira issue: CRDB-6782

@knz knz added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-telemetry A-logging In and around the logging infrastructure. labels Apr 17, 2021
@knz
Copy link
Contributor Author

knz commented Apr 17, 2021

Once we have a TELEMETRY log channel configured as described above, we'd have an alternate approach to "time to first query", simpler than that proposed in #63812:

  • the init node would log a "cluster initialized" event on the TELEMETRY channel upon cluster initialization.
  • every node would log a "first query executed" event on the TELEMETRY channel, the first time it runs a query after starting up
  • upon log aggregation, we'd simply scan the log event stream, and see the log timestamp difference between the "cluster init" event and the first "first query exec" event.

A drawback of this approach is that a failure in the network log collector could cause the "cluster init" event or some of the "first query" events to be lost.

@logston
Copy link
Contributor

logston commented Apr 19, 2021

Looks great @knz!

I have a couple of notes/questions:

we could instead emit telemetry events in an append-only store (e.g. on S3) and mine the telemetry data using event aggregation. / for CockroachCloud customers, set up a logging config on each server that redirects TELEMETRY events to our shared log collector.

Like you mention, for CC this will be a small lift for 21.1+ clusters once the we can direct TELEMETRY events via a logging channel configuration. 20.2 CC clusters will require a lot of care but perhaps we can delay porting them until they've migrated to 21.1, obviating the need for that extra work.

for on-prem customers and backward compatibility, implement a new logging sink for a HTTP collector, and ensure that the default config redirects TELEMETRY events to our existing registration HTTP endpoint via the network logging subsystem

I see this being the case for some time to come. For on prem, it will be tricky to ship logs to our global sink without a proxy given the need for authn/authz when POSTing to the sink. We could provide on-prem customers with a suggested solution/config for relaying telemetry logs such that the isolated vs. non-isolated customers have a very similar setup. Isolated customers would simply have once less sink in their fluentbit/fluentd/logstash config.

and ensure that the default config redirects TELEMETRY events to our existing registration HTTP endpoint via the network logging subsystem

To "close the loop", we could have this endpoint proxy the events to our global sink such that all logs (both CC and on-prem) end up in the same location. Perhaps this was what you were alluding to already and I misunderstood.

Additionally, we would receive the telemetry events from CC clusters through the log collector and these can be redirected to a CRL-internal S3 bucket by a configuration at the level of the log collector.

Nit: Instead, can we dump these logs to a per cluster bucket as well as the global sink? This would require minimal changes in CC.

@logston
Copy link
Contributor

logston commented Apr 20, 2021

I see now that there is a more in depth doc that precedes this issue. Feel free to ignore my comments here in favor of my comments in the other document.

@knz
Copy link
Contributor Author

knz commented Apr 20, 2021

I think the other doc is much more narrow in scope. The "log-based telemetry project", if we start it, will have more stakeholders and a wider impact.

Nit: Instead, can we dump these logs to a per cluster bucket as well as the global sink? This would require minimal changes in CC.

Yes we can do this. The issue at top is a general vision, and the decision of which events gets routed where is a matter of configuration -- i.e. a decision that the consumer of the feature (The SREs and the data team) make together separately.

knz added a commit to knz/cockroach that referenced this issue May 3, 2021
See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase:

- creating a new logging channel TELEMETRY
- ensuring the channel is not redirected to files by default.
- copies the diagnostic report to the TELEMETRY channel,
  *in addition* of sending it to `register.cockroachdb.com`
  as previously.

Later work will aim to further this project by subsuming
the HTTP reporting to `register.cockroachdb.com` entirely
under logging, e.g. via a HTTP sink (see issue cockroachdb#63980).

Release note (cli change): CockroachDB now also sends
telemetry diagnostics to the new logging channel TELEMETRY.
By default, this channel is not connected to file output,
so there is no observable difference in behavior
for default deployments. To redirect the diagnostic
output, one can define a new sink that captures
this channel.

For example, to see diagnostics reports on the standard
error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`
@andy-kimball
Copy link
Contributor

A downside to changing how telemetry works: we'll stop getting telemetry data for older CRDB versions (we still get lots of reports from versions that are years old), since older versions will only know how to log to the registry service, which presumably will be shut down. Or, if it's not shut down, we'll be maintaining two parallel pipelines, which means even more work and complication.

Also, the biggest problems I've seen with the reporting pipeline are:

  1. There is too much data, making it take 10's of minutes to run even simple queries.
  2. The data is low quality, because there are many test/dev clusters operating, and so the Product team doesn't trust it and doesn't use it much.

Is it really the telemetry technology that is problematic (i.e. RPC call vs. logging-based)? We could switch to logging-based telemetry, and I think we'd still be in the same bad place, because that solution doesn't address either problem #1 or problem #2 above.

@knz
Copy link
Contributor Author

knz commented May 7, 2021

Or, if it's not shut down, we'll be maintaining two parallel pipelines, which means even more work and complication.

The way it's being envisioned is this:

  • for on-prem users, we will continue HTTP uploads. The implementation thereof inside crdb will move from the diagnostics package to the util/log package via a HTTP sink (log: implement a HTTP sink #63980)
  • however, this does not mean we are going to maintain 2 pipelines; instead the HTTP collector will be changed to append the data to S3 (or the storage bucket) instead of storing it inside a crdb cluster. This will unify the aggregation across on-prem and CC clusters.

This way, as a technology end result:

  • inside crdb, the emission of telemetry is stillimplemented in 1 place
  • inside CRL, the data arrives in 1 place
  • we preserve compatibility with previous crdb versions by keeping register.cockroachdb.com available
  • the data for old versions can arrive in the S3 bucket so that PMs can see it

the biggest problems I've seen with the reporting pipeline are: [...]

The project discussed here is somewhat orthogonal to these two problems.

FYI:

  • the solution to your "problem 1" will be to cull old data from the collected telemetry.
  • the solution to "problem 2" is to filter the data upfront, either in register.cockroachdb.com or later in the pipeline, before it gets to the storage buckets.

I think it's going to be useful to call out "problem 2" during the architectural discussion next week.

@thtruo
Copy link
Contributor

thtruo commented May 27, 2021

cc @JuanLeon1

knz added a commit to knz/cockroach that referenced this issue Jun 14, 2021
See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. By default, this channel is connected to file output, with
a maximum retention of 1MiB. To redirect the diagnostic output, one
can define a new sink that captures this channel.

For example, to see diagnostics reports on the standard
error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

(At the time of this writing, no events are defined for the TELEMETRY
channel yet.)
knz added a commit to knz/cockroach that referenced this issue Jul 20, 2021
See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. By default, this channel is connected to file output, with
a maximum retention of 1MiB. To redirect the diagnostic output, one
can define a new sink that captures this channel.

For example, to see diagnostics reports on the standard
error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

(At the time of this writing, no events are defined for the TELEMETRY
channel yet.)
knz added a commit to knz/cockroach that referenced this issue Jul 20, 2021
See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. By default, this channel is connected to file output, with
a maximum retention of 1MiB. To redirect the diagnostic output, one
can define a new sink that captures this channel.

For example, to see diagnostics reports on the standard
error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

(At the time of this writing, no events are defined for the TELEMETRY
channel yet.)
knz added a commit to knz/cockroach that referenced this issue Jul 22, 2021
See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. (At the time of this writing, no events are defined for the
TELEMETRY channel yet.)

When no logging configuration is specified, this channel is connected
to file output, with a maximum retention of 1MiB.

To also produce the diagnostic output elsewhere, one can define a new
sink that captures this channel.
For example, to see diagnostics reports on the standard error, one can
use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

When configuring file output, the operator should be careful to apply a
separate maximum retention for the TELEMETRY channel from other file
outputs, as telemetry data can be verbose and outcrowd other logging
messages. For example:

`--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY,
max-group-size: 1MB}, ...}}`
knz added a commit to knz/cockroach that referenced this issue Jul 22, 2021
See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. (At the time of this writing, no events are defined for the
TELEMETRY channel yet.)

When no logging configuration is specified, this channel is connected
to file output, with a maximum retention of 1MiB.

To also produce the diagnostic output elsewhere, one can define a new
sink that captures this channel.
For example, to see diagnostics reports on the standard error, one can
use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

When configuring file output, the operator should be careful to apply a
separate maximum retention for the TELEMETRY channel from other file
outputs, as telemetry data can be verbose and outcrowd other logging
messages. For example:

`--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY,
max-group-size: 1MB}, ...}}`
craig bot pushed a commit that referenced this issue Jul 22, 2021
66427: util/log,cli: new TELEMETRY channel r=maryliag a=knz

Supersedes #64218


Informs #63815.

See discussion in #63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

(see PRs #67809 and #67507 for example follow-ups)

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. (At the time of this writing, no events are defined for the
TELEMETRY channel yet.)

When no logging configuration is specified, this channel is connected
to file output, with a maximum retention of 1MiB.

To also produce the diagnostic output elsewhere, one can define a new
sink that captures this channel.
For example, to see diagnostics reports on the standard error, one can
use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

When configuring file output, the operator should be careful to apply a
separate maximum retention for the TELEMETRY channel from other file
outputs, as telemetry data can be verbose and outcrowd other logging
messages. For example:

`--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY,
max-group-size: 1MB}, ...}}`


67913: roachtest: remove mention of 'roachstress' r=tbg a=knz

The proposed 'roachstress' script linked from the repro steps at
https://gist.github.com/tbg/fc4409e06b4ddc9abc58838cf5c0fe8b is
utterly broken: it confuses local with remote execution.

The correct steps must clarify that `roachprod` and `roachtest`
binaries must be built for the local platform, and the `cockroach` and
`workload` binaries must be built for the remote platform. Someone
trying to run this on macos, or with a linux without the old glibc,
gets inscrutable error messages with the current script.

Additionally, the proposed roachstress script uses a `--cpu-quota`
parameter that's incoherent with `roachtest`'s own default.

Finally, I (@knz) directly witnessed a new hire who was utterly
confused by the amount of automation and assumptions made in the
roachtest script. (Should they install caffeinate? Why? How to
customize the roachtest flags? How to change the number of runs? Can
the test run locally  on my computer?)

In summary, the `roachstress` script is just too magical and actively
hostile to newcomers.

This patch removes it and replaces it by clearer instructiosn.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@knz
Copy link
Contributor Author

knz commented Nov 1, 2021

Heads up that the reason why this issue is still open is that we have not yet migrated all the payloads in the diagnostic report to a structured logging solution yet.

@knz knz changed the title rfc(telemetry): Explore a logging-based telemetry solution telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat Nov 1, 2021
@thtruo
Copy link
Contributor

thtruo commented Nov 1, 2021

cc @dhartunian @nkodali as this was brought up in the logging knowledge transfer session

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-logging In and around the logging infrastructure. A-observability-inf A-telemetry C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability
Projects
None yet
Development

No branches or pull requests

4 participants