Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql,log: new structured event FirstQuery #67809

Closed
wants to merge 2 commits into from

Conversation

knz
Copy link
Contributor

@knz knz commented Jul 20, 2021

(This PR is forked off #66427)

Supersedes #63812
Fixes #66271
Informs #63815

This change aims to expose “time to first query”, a product metric
that correlates with how easy it is for a user to start using
CockroachDB.

The naive approach to do this would be to store a boolean "has the
first query been reported in telemetry yet" inside a system table.
This naive approach is defective because it incurs a storage
read-eval-write cycle with an access to KV storage, or disk, or a
fan-out across the network, upon every query. This is an unacceptable
performance overhead.

Instead, we want the collection of this metric to have virtually no
performance impact on the common case of workloads beyond the first
query.

To get to a better design, we need to better understand the
requirement. A naive interpretation of the requirement would be to
assume that we need the time to first query to be reported exactly
once
.

Besides the fact that this is strictly impossible from a
computer-theoretical perspective [1], an exactly-once approach would
also incur unacceptable performance overhead, due to the need to
synchronize between multiple nodes, when the first query is sent
through a load balancer.

So what can we do instead?

In an initial implementation (#63812), we tried to approximate
“exactly once” reporting with a design that combines “at least once”
with a best effort of “at most once”. This was achieved using a
cluster setting. That solution was storing the time of the first query
inside a cluster setting, and relying on the auto-propagation of
setting changes through the cluster to announce to other nodes that
the first query had been reported already.

The use of a cluster setting also takes care of persisting this
information across node restarts, again towards the need to
approximate “at most once”.

After further discussion with stakeholders, it appears that there
is no strong requirement to reach “at most once”. The data processing
pipeline that consumes the "first query" event is only interested to
trigger the first time the event is recognized for a given cluster and
can be easily configured to ignore events reported after that.

The only practical consideration is to minimize the amount of
telemetry data (to ensure scalability over the CockroachCloud fleet),
and so we are not interested to send a "first query" event for every
single query executed
either (obviously).

Based on this new understanding, this commit simplifies the approach
taken in #63812: it makes every CockroachDB process report the first
SQL query it executes on the TELEMETRY channel. Whether the process
has reported the event already or not is stored in a memory-only
atomic variable.

This approach technically has "at most once" semantics per node (an
event can get lost if the logging output is stalled at the moment the
event is reported), but approximates "at least once" semantics very
well for multi-node clusters, or when nodes restart over time—as,
presumably, the logging system will not be clogged all the time.

There are two open questions in this approach however:

  • when a CockroachCloud cluster is initially configured by the CC
    intrusion control plane), there are “administrative” queries that
    are not internal but also not user initiated.

    With the current patch, the "first query" event will be reported
    for these administrative SQL queries. Is this desired?
    (NB: the same question applied to [WIP] server,sql: make the time-to-first-query observable in telemetry #63812)

    If not, we will need to agree on a reliable way to identify
    a user-initiated query.

    (A similar question exists when the user opens cockroach sql, but
    does not type anything in. The cockroach sql command sends queries
    in the background, and would trigger the "first query" event even
    without user interaction. Are we OK with this?)

  • this approach will cause a "first query" event to be logged at
    least once per SQL pod
    (when multiple pods are running side-by-side),
    and also once after every pod restart.

    To retrieve the “time to first query” (i.e. not “time of
    first query”), the consumer of these events must then compute the
    minimum timestamp of all the first query events received, and
    substract the timestamp of cluster creation. (The cluster creation
    timestamp can be obtained either via an external source, e.g. when
    the customer initiated the orchestration to create the cluster, or
    by retrieving the first node_restart event from the OPS logging
    channel.)

    Is this behavior acceptable?

Of note: departure from "at most once" delivery has another advantage
with regards to product telemetry: it teaches us something about the
effective utilization of all the nodes in a cluster. If we know the
cluster has 10 nodes but only 3 report "first query" events, that
might teach us that the cluster is over-provisioned.

[1] in a distributed system, an event can have guaranteed
at-least-once delivery, or at-most-once, but never both. We also have
the same constraint for CDC.

Release note (cli change): CockroachDB now reports a first_query
structured event on the TELEMETRY channel for the first SQL query
executed on each server process.

knz added 2 commits July 20, 2021 18:44
See discussion in cockroachdb#63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. By default, this channel is connected to file output, with
a maximum retention of 1MiB. To redirect the diagnostic output, one
can define a new sink that captures this channel.

For example, to see diagnostics reports on the standard
error, one can use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

(At the time of this writing, no events are defined for the TELEMETRY
channel yet.)
This change aims to expose “time to first query”, a product metric
that correlates with how easy it is for a user to start using
CockroachDB.

The naive approach to do this would be to store a boolean "has the
first query been reported in telemetry yet" inside a system table.
This naive approach is defective because it incurs a storage
read-eval-write cycle with an access to KV storage, or disk, or a
fan-out across the network, upon every query. This is an unacceptable
performance overhead.

Instead, we want the collection of this metric to have virtually no
performance impact on the common case of workloads beyond the first
query.

To get to a better design, we need to better understand the
requirement.  A naive interpretation of the requirement would be to
assume that we need the time to first query to be reported *exactly
once*.

Besides the fact that this is strictly impossible from a
computer-theoretical perspective [1], an exactly-once approach would
also incur unacceptable performance overhead, due to the need to
synchronize between multiple nodes, when the first query is sent
through a load balancer.

So what can we do instead?

In an initial implementation (cockroachdb#63812), we tried to *approximate*
“exactly once” reporting with a design that combines “at least once”
with a best effort of “at most once”. This was achieved using a
cluster setting. That solution was storing the time of the first query
inside a cluster setting, and relying on the auto-propagation of
setting changes through the cluster to announce to other nodes that
the first query had been reported already.

The use of a cluster setting also takes care of persisting this
information across node restarts, again towards the need to
approximate “at most once”.

After further discussion with stakeholders, it appears that there
is no strong requirement to reach “at most once”. The data processing
pipeline that consumes the "first query" event is only interested to
trigger the first time the event is recognized for a given cluster and
can be easily configured to ignore events reported after that.

The only practical consideration is to minimize the amount of
telemetry data (to ensure scalability over the CockroachCloud fleet),
and so we are not interested to send a "first query" event for *every
single query executed* either (obviously).

Based on this new understanding, this commit simplifies the approach
taken in cockroachdb#63812: it makes every CockroachDB *process* report the first
SQL query it executes on the TELEMETRY channel. Whether the process
has reported the event already or not is stored in a memory-only
atomic variable.

This approach technically has "at most once" semantics per node (an
event can get lost if the logging output is stalled at the moment the
event is reported), but approximates "at least once" semantics very
well for multi-node clusters, or when nodes restart over time—as,
presumably, the logging system will not be clogged all the time.

There are two open questions in this approach however:

- when a CockroachCloud cluster is initially configured by the CC
  intrusion control plane), there are “administrative” queries that
  are *not internal* but also *not user initiated*.

  With the current patch, the "first query" event will be reported
  for these administrative SQL queries. Is this desired?
  (NB: the same question applied to cockroachdb#63812)

  If not, we will need to agree on a reliable way to identify
  a user-initiated query.

  (A similar question exists when the user opens `cockroach sql`, but
  does not type anything in. The `cockroach sql` command sends queries
  in the background, and would trigger the "first query" event even
  without user interaction. Are we OK with this?)

- this approach will cause a "first query" event to be logged *at
  least once per SQL pod* (when multiple pods are running side-by-side),
  and also *once after every pod restart*.

  To retrieve the “time **to** first query” (i.e. not “time **of**
  first query”), the consumer of these events must then compute the
  minimum timestamp of all the first query events received, and
  substract the timestamp of cluster creation.  (The cluster creation
  timestamp can be obtained either via an external source, e.g. when
  the customer initiated the orchestration to create the cluster, or
  by retrieving the first ``node_restart`` event from the OPS logging
  channel.)

  Is this behavior acceptable?

Of note: departure from "at most once" delivery has another advantage
with regards to product telemetry: it teaches us something about the
effective utilization of all the nodes in a cluster. If we know the
cluster has 10 nodes but only 3 report "first query" events, that
might teach us that the cluster is over-provisioned.

[1] in a distributed system, an event can have guaranteed
at-least-once delivery, or at-most-once, but never both. We also have
the same constraint for CDC.

Release note (cli change): CockroachDB now reports a `first_query`
structured event on the TELEMETRY channel for the first SQL query
executed on each server process.
@knz knz requested review from a team as code owners July 20, 2021 16:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jul 20, 2021

See review comment from @thtruo here:
#66427 (comment)

Other review comment:
#66427 (comment)

Internal slack thread:
https://cockroachlabs.slack.com/archives/C0159JK877C/p1623696396053600

@knz
Copy link
Contributor Author

knz commented Jul 20, 2021

The "next action" here is to find another design than the one advertised in this PR,
based on the advice provided by @Azhng and @rafiss in the internal slack thread (linked above).

I could perhaps consider restarting this PR from scratch using said advice, but perhaps the better way do to this would be to transfer ownership of the problem to the SQL obs team altogether.

@maryliag what do you think?

craig bot pushed a commit that referenced this pull request Jul 22, 2021
66427: util/log,cli: new TELEMETRY channel r=maryliag a=knz

Supersedes #64218


Informs #63815.

See discussion in #63815: we would like to leverage the logging
subsystem, in particular network logging, as a new way
to channel diagnostic reports.

This work will operate in multiple phases. This commit is only
about the first phase: creating a new logging channel TELEMETRY.

(see PRs #67809 and #67507 for example follow-ups)

Release note (cli change): CockroachDB now supports a new logging
channel called TELEMETRY. This will be used in later versions to
report diagnostic events useful to Cockroach Labs for product
analytics. (At the time of this writing, no events are defined for the
TELEMETRY channel yet.)

When no logging configuration is specified, this channel is connected
to file output, with a maximum retention of 1MiB.

To also produce the diagnostic output elsewhere, one can define a new
sink that captures this channel.
For example, to see diagnostics reports on the standard error, one can
use: `--log='sinks: {stderr: {channels: TELEMETRY, filter: INFO}}'`

When configuring file output, the operator should be careful to apply a
separate maximum retention for the TELEMETRY channel from other file
outputs, as telemetry data can be verbose and outcrowd other logging
messages. For example:

`--log='sinks: {file-groups: {telemetry: {channels: TELEMETRY,
max-group-size: 1MB}, ...}}`


67913: roachtest: remove mention of 'roachstress' r=tbg a=knz

The proposed 'roachstress' script linked from the repro steps at
https://gist.github.com/tbg/fc4409e06b4ddc9abc58838cf5c0fe8b is
utterly broken: it confuses local with remote execution.

The correct steps must clarify that `roachprod` and `roachtest`
binaries must be built for the local platform, and the `cockroach` and
`workload` binaries must be built for the remote platform. Someone
trying to run this on macos, or with a linux without the old glibc,
gets inscrutable error messages with the current script.

Additionally, the proposed roachstress script uses a `--cpu-quota`
parameter that's incoherent with `roachtest`'s own default.

Finally, I (@knz) directly witnessed a new hire who was utterly
confused by the amount of automation and assumptions made in the
roachtest script. (Should they install caffeinate? Why? How to
customize the roachtest flags? How to change the number of runs? Can
the test run locally  on my computer?)

In summary, the `roachstress` script is just too magical and actively
hostile to newcomers.

This patch removes it and replaces it by clearer instructiosn.

Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
@knz
Copy link
Contributor Author

knz commented Jan 10, 2023

@maryliag @THardy98 I'm going to close this PR; however, I would encourage you to pick up any useful bits from it, especially the motivation in the commit message / PR description, because I think it is applicable to future work.

@knz knz closed this Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: send a distinct telemetry event for the first query
2 participants