-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815
Comments
Once we have a TELEMETRY log channel configured as described above, we'd have an alternate approach to "time to first query", simpler than that proposed in #63812:
A drawback of this approach is that a failure in the network log collector could cause the "cluster init" event or some of the "first query" events to be lost. |
Looks great @knz! I have a couple of notes/questions:
Like you mention, for CC this will be a small lift for 21.1+ clusters once the we can direct TELEMETRY events via a logging channel configuration. 20.2 CC clusters will require a lot of care but perhaps we can delay porting them until they've migrated to 21.1, obviating the need for that extra work.
I see this being the case for some time to come. For on prem, it will be tricky to ship logs to our global sink without a proxy given the need for authn/authz when POSTing to the sink. We could provide on-prem customers with a suggested solution/config for relaying telemetry logs such that the isolated vs. non-isolated customers have a very similar setup. Isolated customers would simply have once less sink in their fluentbit/fluentd/logstash config.
To "close the loop", we could have this endpoint proxy the events to our global sink such that all logs (both CC and on-prem) end up in the same location. Perhaps this was what you were alluding to already and I misunderstood.
Nit: Instead, can we dump these logs to a per cluster bucket as well as the global sink? This would require minimal changes in CC. |
I see now that there is a more in depth doc that precedes this issue. Feel free to ignore my comments here in favor of my comments in the other document. |
I think the other doc is much more narrow in scope. The "log-based telemetry project", if we start it, will have more stakeholders and a wider impact.
Yes we can do this. The issue at top is a general vision, and the decision of which events gets routed where is a matter of configuration -- i.e. a decision that the consumer of the feature (The SREs and the data team) make together separately. |
See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: - creating a new logging channel TELEMETRY - ensuring the channel is not redirected to files by default. - copies the diagnostic report to the TELEMETRY channel, *in addition* of sending it to `register.cockroachdb.com` as previously. Later work will aim to further this project by subsuming the HTTP reporting to `register.cockroachdb.com` entirely under logging, e.g. via a HTTP sink (see issue cockroachdb#63980). Release note (cli change): CockroachDB now also sends telemetry diagnostics to the new logging channel TELEMETRY. By default, this channel is not connected to file output, so there is no observable difference in behavior for default deployments. To redirect the diagnostic output, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`
A downside to changing how telemetry works: we'll stop getting telemetry data for older CRDB versions (we still get lots of reports from versions that are years old), since older versions will only know how to log to the registry service, which presumably will be shut down. Or, if it's not shut down, we'll be maintaining two parallel pipelines, which means even more work and complication. Also, the biggest problems I've seen with the reporting pipeline are:
Is it really the telemetry technology that is problematic (i.e. RPC call vs. logging-based)? We could switch to logging-based telemetry, and I think we'd still be in the same bad place, because that solution doesn't address either problem #1 or problem #2 above. |
The way it's being envisioned is this:
This way, as a technology end result:
The project discussed here is somewhat orthogonal to these two problems. FYI:
I think it's going to be useful to call out "problem 2" during the architectural discussion next week. |
cc @JuanLeon1 |
See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. By default, this channel is connected to file output, with a maximum retention of 1MiB. To redirect the diagnostic output, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` (At the time of this writing, no events are defined for the TELEMETRY channel yet.)
See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. By default, this channel is connected to file output, with a maximum retention of 1MiB. To redirect the diagnostic output, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` (At the time of this writing, no events are defined for the TELEMETRY channel yet.)
See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. By default, this channel is connected to file output, with a maximum retention of 1MiB. To redirect the diagnostic output, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` (At the time of this writing, no events are defined for the TELEMETRY channel yet.)
See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. (At the time of this writing, no events are defined for the TELEMETRY channel yet.) When no logging configuration is specified, this channel is connected to file output, with a maximum retention of 1MiB. To also produce the diagnostic output elsewhere, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` When configuring file output, the operator should be careful to apply a separate maximum retention for the TELEMETRY channel from other file outputs, as telemetry data can be verbose and outcrowd other logging messages. For example: `--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY, max-group-size: 1MB}, ...}}`
See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. (At the time of this writing, no events are defined for the TELEMETRY channel yet.) When no logging configuration is specified, this channel is connected to file output, with a maximum retention of 1MiB. To also produce the diagnostic output elsewhere, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` When configuring file output, the operator should be careful to apply a separate maximum retention for the TELEMETRY channel from other file outputs, as telemetry data can be verbose and outcrowd other logging messages. For example: `--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY, max-group-size: 1MB}, ...}}`
66427: util/log,cli: new TELEMETRY channel r=maryliag a=knz Supersedes #64218 Informs #63815. See discussion in #63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. (see PRs #67809 and #67507 for example follow-ups) Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. (At the time of this writing, no events are defined for the TELEMETRY channel yet.) When no logging configuration is specified, this channel is connected to file output, with a maximum retention of 1MiB. To also produce the diagnostic output elsewhere, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` When configuring file output, the operator should be careful to apply a separate maximum retention for the TELEMETRY channel from other file outputs, as telemetry data can be verbose and outcrowd other logging messages. For example: `--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY, max-group-size: 1MB}, ...}}` 67913: roachtest: remove mention of 'roachstress' r=tbg a=knz The proposed 'roachstress' script linked from the repro steps at https://gist.github.com/tbg/fc4409e06b4ddc9abc58838cf5c0fe8b is utterly broken: it confuses local with remote execution. The correct steps must clarify that `roachprod` and `roachtest` binaries must be built for the local platform, and the `cockroach` and `workload` binaries must be built for the remote platform. Someone trying to run this on macos, or with a linux without the old glibc, gets inscrutable error messages with the current script. Additionally, the proposed roachstress script uses a `--cpu-quota` parameter that's incoherent with `roachtest`'s own default. Finally, I (@knz) directly witnessed a new hire who was utterly confused by the amount of automation and assumptions made in the roachtest script. (Should they install caffeinate? Why? How to customize the roachtest flags? How to change the number of runs? Can the test run locally on my computer?) In summary, the `roachstress` script is just too magical and actively hostile to newcomers. This patch removes it and replaces it by clearer instructiosn. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>
Heads up that the reason why this issue is still open is that we have not yet migrated all the payloads in the diagnostic report to a structured logging solution yet. |
cc @dhartunian @nkodali as this was brought up in the logging knowledge transfer session |
We have marked this issue as stale because it has been inactive for |
This was discussed with @vilamurugu and @petermattis
Instead of pushing state changes to a central registration server, we could instead emit telemetry events in an append-only store (e.g. on S3) and mine the telemetry data using event aggregation.
This approach could be called “telemetry logging” to contrast it with the (current) “telemetry updates”.
How would this work?
@petermattis suggests leveraging the new infrastructure for structured logging. We could operate as follows:
eventpb
that embeds telemetry data (this could possibly be as simple as a new proto message type that embeds a field of the type defined in thediagnosticspb
package)log.StructuredEvent
instead of a custom HTTP PUT/GET client and request.This plan depends on the existence of a log collector for CockroachCloud, however we already have implemented such a thing for other security purposes. We can thus extend its usage.
Once this plan is implemented, we would still receive the telemetry for on-prem users, albeit using HTTP requests issued by the logging system (instead of the ad-hoc HTTP client in our current diagnostics package).
Additionally, we would receive the telemetry events from CC clusters through the log collector and these can be redirected to a CRL-internal S3 bucket by a configuration at the level of the log collector.
Another benefit is that it would allow on-prem users with network restrictions to collect telemetry events using their own log collector, and deliver them to us manually upon request. (Currently telemetry from isolated on-prem clusters is simply lost.)
Epic: CC-4303
Jira issue: CRDB-6782
The text was updated successfully, but these errors were encountered: