sql,log: new structured event FirstQuery #67809

knz · 2021-07-20T16:48:55Z

(This PR is forked off #66427)

Supersedes #63812
Fixes #66271
Informs #63815

This change aims to expose “time to first query”, a product metric
that correlates with how easy it is for a user to start using
CockroachDB.

The naive approach to do this would be to store a boolean "has the
first query been reported in telemetry yet" inside a system table.
This naive approach is defective because it incurs a storage
read-eval-write cycle with an access to KV storage, or disk, or a
fan-out across the network, upon every query. This is an unacceptable
performance overhead.

Instead, we want the collection of this metric to have virtually no
performance impact on the common case of workloads beyond the first
query.

To get to a better design, we need to better understand the
requirement. A naive interpretation of the requirement would be to
assume that we need the time to first query to be reported exactly
once.

Besides the fact that this is strictly impossible from a
computer-theoretical perspective [1], an exactly-once approach would
also incur unacceptable performance overhead, due to the need to
synchronize between multiple nodes, when the first query is sent
through a load balancer.

So what can we do instead?

In an initial implementation (#63812), we tried to approximate
“exactly once” reporting with a design that combines “at least once”
with a best effort of “at most once”. This was achieved using a
cluster setting. That solution was storing the time of the first query
inside a cluster setting, and relying on the auto-propagation of
setting changes through the cluster to announce to other nodes that
the first query had been reported already.

The use of a cluster setting also takes care of persisting this
information across node restarts, again towards the need to
approximate “at most once”.

After further discussion with stakeholders, it appears that there
is no strong requirement to reach “at most once”. The data processing
pipeline that consumes the "first query" event is only interested to
trigger the first time the event is recognized for a given cluster and
can be easily configured to ignore events reported after that.

The only practical consideration is to minimize the amount of
telemetry data (to ensure scalability over the CockroachCloud fleet),
and so we are not interested to send a "first query" event for every
single query executed either (obviously).

Based on this new understanding, this commit simplifies the approach
taken in #63812: it makes every CockroachDB process report the first
SQL query it executes on the TELEMETRY channel. Whether the process
has reported the event already or not is stored in a memory-only
atomic variable.

This approach technically has "at most once" semantics per node (an
event can get lost if the logging output is stalled at the moment the
event is reported), but approximates "at least once" semantics very
well for multi-node clusters, or when nodes restart over time—as,
presumably, the logging system will not be clogged all the time.

There are two open questions in this approach however:

when a CockroachCloud cluster is initially configured by the CC
intrusion control plane), there are “administrative” queries that
are not internal but also not user initiated.

With the current patch, the "first query" event will be reported
for these administrative SQL queries. Is this desired?
(NB: the same question applied to [WIP] server,sql: make the time-to-first-query observable in telemetry #63812)

If not, we will need to agree on a reliable way to identify
a user-initiated query.

(A similar question exists when the user opens cockroach sql, but
does not type anything in. The cockroach sql command sends queries
in the background, and would trigger the "first query" event even
without user interaction. Are we OK with this?)
this approach will cause a "first query" event to be logged at
least once per SQL pod (when multiple pods are running side-by-side),
and also once after every pod restart.

To retrieve the “time to first query” (i.e. not “time of
first query”), the consumer of these events must then compute the
minimum timestamp of all the first query events received, and
substract the timestamp of cluster creation. (The cluster creation
timestamp can be obtained either via an external source, e.g. when
the customer initiated the orchestration to create the cluster, or
by retrieving the first node_restart event from the OPS logging
channel.)

Is this behavior acceptable?

Of note: departure from "at most once" delivery has another advantage
with regards to product telemetry: it teaches us something about the
effective utilization of all the nodes in a cluster. If we know the
cluster has 10 nodes but only 3 report "first query" events, that
might teach us that the cluster is over-provisioned.

[1] in a distributed system, an event can have guaranteed
at-least-once delivery, or at-most-once, but never both. We also have
the same constraint for CDC.

Release note (cli change): CockroachDB now reports a first_query
structured event on the TELEMETRY channel for the first SQL query
executed on each server process.

See discussion in cockroachdb#63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. By default, this channel is connected to file output, with a maximum retention of 1MiB. To redirect the diagnostic output, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` (At the time of this writing, no events are defined for the TELEMETRY channel yet.)

This change aims to expose “time to first query”, a product metric that correlates with how easy it is for a user to start using CockroachDB. The naive approach to do this would be to store a boolean "has the first query been reported in telemetry yet" inside a system table. This naive approach is defective because it incurs a storage read-eval-write cycle with an access to KV storage, or disk, or a fan-out across the network, upon every query. This is an unacceptable performance overhead. Instead, we want the collection of this metric to have virtually no performance impact on the common case of workloads beyond the first query. To get to a better design, we need to better understand the requirement. A naive interpretation of the requirement would be to assume that we need the time to first query to be reported *exactly once*. Besides the fact that this is strictly impossible from a computer-theoretical perspective [1], an exactly-once approach would also incur unacceptable performance overhead, due to the need to synchronize between multiple nodes, when the first query is sent through a load balancer. So what can we do instead? In an initial implementation (cockroachdb#63812), we tried to *approximate* “exactly once” reporting with a design that combines “at least once” with a best effort of “at most once”. This was achieved using a cluster setting. That solution was storing the time of the first query inside a cluster setting, and relying on the auto-propagation of setting changes through the cluster to announce to other nodes that the first query had been reported already. The use of a cluster setting also takes care of persisting this information across node restarts, again towards the need to approximate “at most once”. After further discussion with stakeholders, it appears that there is no strong requirement to reach “at most once”. The data processing pipeline that consumes the "first query" event is only interested to trigger the first time the event is recognized for a given cluster and can be easily configured to ignore events reported after that. The only practical consideration is to minimize the amount of telemetry data (to ensure scalability over the CockroachCloud fleet), and so we are not interested to send a "first query" event for *every single query executed* either (obviously). Based on this new understanding, this commit simplifies the approach taken in cockroachdb#63812: it makes every CockroachDB *process* report the first SQL query it executes on the TELEMETRY channel. Whether the process has reported the event already or not is stored in a memory-only atomic variable. This approach technically has "at most once" semantics per node (an event can get lost if the logging output is stalled at the moment the event is reported), but approximates "at least once" semantics very well for multi-node clusters, or when nodes restart over time—as, presumably, the logging system will not be clogged all the time. There are two open questions in this approach however: - when a CockroachCloud cluster is initially configured by the CC intrusion control plane), there are “administrative” queries that are *not internal* but also *not user initiated*. With the current patch, the "first query" event will be reported for these administrative SQL queries. Is this desired? (NB: the same question applied to cockroachdb#63812) If not, we will need to agree on a reliable way to identify a user-initiated query. (A similar question exists when the user opens `cockroach sql`, but does not type anything in. The `cockroach sql` command sends queries in the background, and would trigger the "first query" event even without user interaction. Are we OK with this?) - this approach will cause a "first query" event to be logged *at least once per SQL pod* (when multiple pods are running side-by-side), and also *once after every pod restart*. To retrieve the “time **to** first query” (i.e. not “time **of** first query”), the consumer of these events must then compute the minimum timestamp of all the first query events received, and substract the timestamp of cluster creation. (The cluster creation timestamp can be obtained either via an external source, e.g. when the customer initiated the orchestration to create the cluster, or by retrieving the first ``node_restart`` event from the OPS logging channel.) Is this behavior acceptable? Of note: departure from "at most once" delivery has another advantage with regards to product telemetry: it teaches us something about the effective utilization of all the nodes in a cluster. If we know the cluster has 10 nodes but only 3 report "first query" events, that might teach us that the cluster is over-provisioned. [1] in a distributed system, an event can have guaranteed at-least-once delivery, or at-most-once, but never both. We also have the same constraint for CDC. Release note (cli change): CockroachDB now reports a `first_query` structured event on the TELEMETRY channel for the first SQL query executed on each server process.

cockroach-teamcity · 2021-07-20T16:49:04Z

This change is

knz · 2021-07-20T16:50:27Z

See review comment from @thtruo here:
#66427 (comment)

Other review comment:
#66427 (comment)

Internal slack thread:
https://cockroachlabs.slack.com/archives/C0159JK877C/p1623696396053600

knz · 2021-07-20T16:55:43Z

The "next action" here is to find another design than the one advertised in this PR,
based on the advice provided by @Azhng and @rafiss in the internal slack thread (linked above).

I could perhaps consider restarting this PR from scratch using said advice, but perhaps the better way do to this would be to transfer ownership of the problem to the SQL obs team altogether.

@maryliag what do you think?

@knz

66427: util/log,cli: new TELEMETRY channel r=maryliag a=knz Supersedes #64218 Informs #63815. See discussion in #63815: we would like to leverage the logging subsystem, in particular network logging, as a new way to channel diagnostic reports. This work will operate in multiple phases. This commit is only about the first phase: creating a new logging channel TELEMETRY. (see PRs #67809 and #67507 for example follow-ups) Release note (cli change): CockroachDB now supports a new logging channel called TELEMETRY. This will be used in later versions to report diagnostic events useful to Cockroach Labs for product analytics. (At the time of this writing, no events are defined for the TELEMETRY channel yet.) When no logging configuration is specified, this channel is connected to file output, with a maximum retention of 1MiB. To also produce the diagnostic output elsewhere, one can define a new sink that captures this channel. For example, to see diagnostics reports on the standard error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'` When configuring file output, the operator should be careful to apply a separate maximum retention for the TELEMETRY channel from other file outputs, as telemetry data can be verbose and outcrowd other logging messages. For example: `--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY, max-group-size: 1MB}, ...}}` 67913: roachtest: remove mention of 'roachstress' r=tbg a=knz The proposed 'roachstress' script linked from the repro steps at https://gist.github.com/tbg/fc4409e06b4ddc9abc58838cf5c0fe8b is utterly broken: it confuses local with remote execution. The correct steps must clarify that `roachprod` and `roachtest` binaries must be built for the local platform, and the `cockroach` and `workload` binaries must be built for the remote platform. Someone trying to run this on macos, or with a linux without the old glibc, gets inscrutable error messages with the current script. Additionally, the proposed roachstress script uses a `--cpu-quota` parameter that's incoherent with `roachtest`'s own default. Finally, I (@knz) directly witnessed a new hire who was utterly confused by the amount of automation and assumptions made in the roachtest script. (Should they install caffeinate? Why? How to customize the roachtest flags? How to change the number of runs? Can the test run locally on my computer?) In summary, the `roachstress` script is just too magical and actively hostile to newcomers. This patch removes it and replaces it by clearer instructiosn. Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]>

knz · 2023-01-10T12:25:16Z

@maryliag @THardy98 I'm going to close this PR; however, I would encourage you to pick up any useful bits from it, especially the motivation in the commit message / PR description, because I think it is applicable to future work.

knz added 2 commits July 20, 2021 18:44

knz requested review from a team as code owners July 20, 2021 16:48

knz mentioned this pull request Jul 20, 2021

util/log,cli: new TELEMETRY channel #66427

Merged

knz mentioned this pull request Jul 22, 2021

sql: add sql shell telemetry counter #67606

Merged

knz closed this Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql,log: new structured event FirstQuery #67809

sql,log: new structured event FirstQuery #67809

knz commented Jul 20, 2021 •

edited

Loading

cockroach-teamcity commented Jul 20, 2021

knz commented Jul 20, 2021

knz commented Jul 20, 2021

knz commented Jan 10, 2023

sql,log: new structured event FirstQuery #67809

sql,log: new structured event FirstQuery #67809

Conversation

knz commented Jul 20, 2021 • edited Loading

cockroach-teamcity commented Jul 20, 2021

knz commented Jul 20, 2021

knz commented Jul 20, 2021

knz commented Jan 10, 2023

knz commented Jul 20, 2021 •

edited

Loading