[WIP] server,sql: make the time-to-first-query observable in telemetry #63812

knz · 2021-04-17T13:38:14Z

(inspired from the discussion on #55059 and a request by @vilamurugu )

This change aims to expose “time to first query”, a product
metric that correlates with how easy it is for a user
to start using CockroachDB.

The collection of “time to first query” is made complicated
to collect because of the following:

we cannot compute this entirely on one node using solely in-memory
operations: we need to find the first time a query is executed across
a fleet of nodes, and we need to remember the value when nodes get
restarted.
we would like to avoid a read-eval-write cycle with an access to KV
storage, or disk, or a fan-out across the network, upon every query. The
collection of this metric should have virtually no performance impact
on the common case of workloads beyond the first access.
the metric “time to first query” is different from “time of first
query”: we want to track the amount of time between cluster
creation and the time of first query.
we wish to avoid taking into account “internal” SQL queries, such
as those issued internally by CockroachDB itself and those
issued by helper/orchestration tools.

In order to reach the stated goal, this commit proposes the following
model:

to avoid “internal” SQL queries, the mechanism described below
only looks at queries issued when the application_name is
not “internal”, that is, does not start with the special
prefix "$ ".
nodes keep track of their “time of first query” as
follows:
- an in-memory only timestamp is maintained and checked upon
  every query execution. Accesses are atomic to avoid
  any performance impact.
- separately, an asynchronous background process
  (startSynchronizeFirstQueryTimestamp) synchronizes
  the in-memory timestamp with a new cluster setting
  sql.metrics.first_query.timestamp.
  
  The choice is made to use a cluster setting because these
  automatically synchronize between nodes and are
  automatically embarked in telemetry already.
separately and independently, the cluster now remembers its
initialization time by storing a timestamp inside a new cluster
setting sql.metrics.cluster_init.timestamp. This is set only once,
upon cluster init.

Again, the choice is made to use a cluster setting because these are
automatically embarked in telemetry already.

Note that this change does not compute the time to first query,
but instead presents two different timestamp in telemetry
(from the cluster settings) which can be subtracted from each other
to reach the stated goal.

Here is an example use, using cockroach demo (nb: by default demo
uses an internal application_name):

```
$ cockroach demo

root> show cluster setting sql.metrics.cluster_init.timestamp;
sql.metrics.cluster_init.timestamp
--------------------------------------
                 1618666751186151817

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                                  0

root> set application_name = 'foo';

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                1618666852579067840

root> select 1;

root> -- observe: the first query timestamp does not change!

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                1618666852579067840
```

Note also that the new first_query.timestamp value is approximate.
This is explained by the following comment in the source code:

// The timestamp of the first query that was not issued by
// an "internal" client (as determined by application_name).
//
// This is set to non-zero by any node receiving a query if the
// value is observed to still be zero.
// Meanwhile, as for every cluster setting, it is possible for this
// value to take a while to propagate to the rest of this cluster.
// Therefore, it is possible for multiple nodes to concurrently
// write to the cluster setting, when they receive their first
// client query within a short time frame (the setting propagation
// delay).
// We find this acceptable because this propagation delay
// is short (seconds) whereas the information we hope to extract
// from this timestamp need only be at minutes/hours granularity.

See the release note below for a descriptionof impact.

Release note (sql change): CockroachDB now stores the time at which
the SQL logical layer of a cluster was initialized in the cluster
setting sql.metrics.cluster_init.timestamp.It also stores the
approximate time at which the first query from a SQL client is
executed in the new cluster setting
sql.metrics.first_query.timestamp.

This makes it possible to compute the time to first query since
cluster initialization, by subtracting the two values.

Note that the value is approximate because when the initial clients
connect to multiple nodes simultaneously, it is possible for multiple
nodes to concurrently estimate that they received the “first
query”. The uncertainty in this approximation is limited by the
propagation delay of cluster settings, which is typically a dozen of
seconds or less across the entire cluster.

cockroach-teamcity · 2021-04-17T13:38:20Z

This change is

knz · 2021-04-17T14:31:23Z

@jordanlewis heads up that I am delaying the implementation of unit testing until we reach some consensus about whether this approach is acceptable.

This change aims to expose “time to first query”, a product metric that correlates with how easy it is for a user to start using CockroachDB. The collection of “time to first query” is made complicated to collect because of the following: - we cannot compute this entirely on one node using solely in-memory operations: we need to find the first time a query is executed across a fleet of nodes, and we need to remember the value when nodes get restarted. - we would like to avoid a read-eval-write cycle with an access to KV storage, or disk, or a fan-out across the network, upon every query. The collection of this metric should have virtually no performance impact on the common case of workloads beyond the first access. - the metric “time to first query” is different from “time *of* first query”: we want to track the *amount of time* between cluster creation and the time of first query. - we wish to avoid taking into account “internal” SQL queries, such as those issued internally by CockroachDB itself and those issued by helper/orchestration tools. In order to reach the stated goal, this commit proposes the following model: - to avoid “internal” SQL queries, the mechanism described below only looks at queries issued when the `application_name` is not “internal”, that is, does not start with the special prefix `"$ "`. - nodes keep track of their “time *of* first query” as follows: - an in-memory only timestamp is maintained and checked upon every query execution. Accesses are atomic to avoid any performance impact. - separately, an asynchronous background process (`startSynchronizeFirstQueryTimestamp`) synchronizes the in-memory timestamp with a new cluster setting `sql.metrics.first_query.timestamp`. The choice is made to use a cluster setting because these automatically synchronize between nodes and are automatically embarked in telemetry already. - separately and independently, the cluster now remembers its initialization time by storing a timestamp inside a new cluster setting `sql.metrics.cluster_init.timestamp`. This is set only once, upon cluster init. Again, the choice is made to use a cluster setting because these are automatically embarked in telemetry already. Note that this change does not compute the time *to* first query, but instead presents two different timestamp in telemetry (from the cluster settings) which can be subtracted from each other to reach the stated goal. Here is an example use, using `cockroach demo` (nb: by default `demo` uses an internal `application_name`): ``` $ cockroach demo root> show cluster setting sql.metrics.cluster_init.timestamp; sql.metrics.cluster_init.timestamp -------------------------------------- 1618666751186151817 root> show cluster setting sql.metrics.first_query.timestamp; sql.metrics.first_query.timestamp ------------------------------------- 0 root> set application_name = 'foo'; root> show cluster setting sql.metrics.first_query.timestamp; sql.metrics.first_query.timestamp ------------------------------------- 1618666852579067840 root> select 1; root> -- observe: the first query timestamp does not change! root> show cluster setting sql.metrics.first_query.timestamp; sql.metrics.first_query.timestamp ------------------------------------- 1618666852579067840 ``` Note also that the new `first_query.timestamp` value is approximate. This is explained by the following comment in the source code: ``` // The timestamp of the first query that was not issued by // an "internal" client (as determined by application_name). // // This is set to non-zero by any node receiving a query if the // value is observed to still be zero. // Meanwhile, as for every cluster setting, it is possible for this // value to take a while to propagate to the rest of this cluster. // Therefore, it is possible for multiple nodes to concurrently // write to the cluster setting, when they receive their first // client query within a short time frame (the setting propagation // delay). // We find this acceptable because this propagation delay // is short (seconds) whereas the information we hope to extract // from this timestamp need only be at minutes/hours granularity. ``` See the release note below for a descriptionof impact. Release note (sql change): CockroachDB now stores the time at which the SQL logical layer of a cluster was initialized in the cluster setting `sql.metrics.cluster_init.timestamp`.It also stores the approximate time *at* which the first query from a SQL client is executed in the new cluster setting `sql.metrics.first_query.timestamp`. This makes it possible to compute the time *to* first query since cluster initialization, by subtracting the two values. Note that the value is approximate because when the initial clients connect to multiple nodes simultaneously, it is possible for multiple nodes to concurrently estimate that they received the “first query”. The uncertainty in this approximation is limited by the propagation delay of cluster settings, which is typically a dozen of seconds or less across the entire cluster.

knz · 2021-04-17T16:07:26Z

See this comment for an alternate approach: #63815 (comment)

@bdarnell

63301: bootstrap: add cockroach/bin to Debian PATH r=rickystewart a=erikgrinaker This makes it more convenient to use tools like roachprod on GCE workers. Release note: None 63656: doctor: add examine and recreate commands r=postamar a=postamar doctor: add verbose flag This commit adds a --verbose flag to the debug doctor CLI command which increases output verbosity. Release note (cli change): added --verbose flag to debug doctor command. doctor: fix string comparison bug Previously, doctor could panic when outputting validation failures with error messages under a certain length. This commit fixes this by using the proper string prefix comparison function, strings.HasPrefix. Release note: None sql: add checks in unsafe_upsert_descriptor Previously, it was quite easy to cause a node to crash by passing an invalid descriptor argument to the crdb_internal.unsafe_upsert_descriptor repair function. This commit makes this function more resilient to invalid inputs by changing the descpb.GetDescriptorMetadata function to return errors instead of panicking. Release note: None doctor: add examine and recreate commands The doctor tool can be very helpful for troubleshooting schema corruption issues. These investigations often involve loading the corrupted descriptor(s) into a local test environment to either better understand the root cause or to craft a valid descriptor for a repair. In any case, until now this required taking the descriptor hex encodings from the debug.zip and updating the system.descriptor table. This can be tedious and error-prone. Recent improvements to descriptor validation logic have also made it harder to deliberately corrupt descriptors, which has unfortunate side-effects in this use case. This commit changes the doctor CLI commands, adding the 'examine' and 'recreate' commands. The former did what doctor used to do, the latter will output a series of SQL statements. When executed on an empty cluster these will populate it with all of the non-system descriptors and namespace table entries. These descriptors are almost as they were, except: - the version counter is reset to 1, - the modification time is unset, - the create-as-of time is unset (tables only). Maintaining the original values would violate too many assumptions to be safe. Fixes #62459. Release note (cli change): The doctor debug tool now has new 'examine' and 'recreate' commands. The new 'debug doctor examine' invocation replaces the old 'debug doctor' invocation. The new 'debug doctor recreate' invocation prints SQL statements which will recreate the descriptor and namespace state when run in a new, empty cluster. 63748: server: remove the setting `server.remote_debugging.mode` r=bdarnell a=knz Fixes #57072. Fixes #26283 Fixes #24992 As suggested/requested by @bdarnell. Release note (security update): Certain HTTP debug endpoints reserved to `admin` users now return more details about range start/end keys, such as the "hot ranges" report. Release note (security update): The cluster setting `server.remote_debugging.mode` has been removed. The debug endpoints are now available to every client with access to the HTTP port. All the HTTP URLs previously affected by this setting already have user authentication and require a user to be logged in that's member of the `admin` group, so there was no need for an additional layer of security. 63754: roachtest: add restore2TB config with large nodes r=pbardea a=pbardea This test adds a new configuration to the restore2TB roachtest to test the veritical scaling capabilities of restore on large nodes. Release note: None 63814: backupccl: fix CheckEmittedEvents r=adityamaru a=knz Found while working on #63812. This ensures the test does not panic if there are more entries in the log than expected. Release note: None 63933: roachtest: skip psycopg password_encryption test r=RichardJCai a=rafiss This is being skipped upstream (psycopg/psycopg2#1263) but until that is released, we mark it as expected to fail. fixes #63893 Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

This change aims to expose “time to first query”, a product metric that correlates with how easy it is for a user to start using CockroachDB. The naive approach to do this would be to store a boolean "has the first query been reported in telemetry yet" inside a system table. This naive approach is defective because it incurs a storage read-eval-write cycle with an access to KV storage, or disk, or a fan-out across the network, upon every query. This is an unacceptable performance overhead. Instead, we want the collection of this metric to have virtually no performance impact on the common case of workloads beyond the first query. To get to a better design, we need to better understand the requirement. A naive interpretation of the requirement would be to assume that we need the time to first query to be reported *exactly once*. Besides the fact that this is strictly impossible from a computer-theoretical perspective [1], an exactly-once approach would also incur unacceptable performance overhead, due to the need to synchronize between multiple nodes, when the first query is sent through a load balancer. So what can we do instead? In an initial implementation (cockroachdb#63812), we tried to *approximate* “exactly once” reporting with a design that combines “at least once” with a best effort of “at most once”. This was achieved using a cluster setting. That solution was storing the time of the first query inside a cluster setting, and relying on the auto-propagation of setting changes through the cluster to announce to other nodes that the first query had been reported already. The use of a cluster setting also takes care of persisting this information across node restarts, again towards the need to approximate “at most once”. After further discussion with stakeholders, it appears that there is no strong requirement to reach “at most once”. The data processing pipeline that consumes the "first query" event is only interested to trigger the first time the event is recognized for a given cluster and can be easily configured to ignore events reported after that. The only practical consideration is to minimize the amount of telemetry data (to ensure scalability over the CockroachCloud fleet), and so we are not interested to send a "first query" event for *every single query executed* either (obviously). Based on this new understanding, this commit simplifies the approach taken in cockroachdb#63812: it makes every CockroachDB *process* report the first SQL query it executes on the TELEMETRY channel. Whether the process has reported the event already or not is stored in a memory-only atomic variable. This approach technically has "at most once" semantics per node (an event can get lost if the logging output is stalled at the moment the event is reported), but approximates "at least once" semantics very well for multi-node clusters, or when nodes restart over time—as, presumably, the logging system will not be clogged all the time. There are two open questions in this approach however: - when a CockroachCloud cluster is initially configured by the CC intrusion control plane), there are “administrative” queries that are *not internal* but also *not user initiated*. With the current patch, the "first query" event will be reported for these administrative SQL queries. Is this desired? (NB: the same question applied to cockroachdb#63812) If not, we will need to agree on a reliable way to identify a user-initiated query. (A similar question exists when the user opens `cockroach sql`, but does not type anything in. The `cockroach sql` command sends queries in the background, and would trigger the "first query" event even without user interaction. Are we OK with this?) - this approach will cause a "first query" event to be logged *at least once per SQL pod* (when multiple pods are running side-by-side), and also *once after every pod restart*. To retrieve the “time **to** first query” (i.e. not “time **of** first query”), the consumer of these events must then compute the minimum timestamp of all the first query events received, and substract the timestamp of cluster creation. (The cluster creation timestamp can be obtained either via an external source, e.g. when the customer initiated the orchestration to create the cluster, or by retrieving the first ``node_restart`` event from the OPS logging channel.) Is this behavior acceptable? Of note: departure from "at most once" delivery has another advantage with regards to product telemetry: it teaches us something about the effective utilization of all the nodes in a cluster. If we know the cluster has 10 nodes but only 3 report "first query" events, that might teach us that the cluster is over-provisioned. [1] in a distributed system, an event can have guaranteed at-least-once delivery, or at-most-once, but never both. We also have the same constraint for CDC. Release note (cli change): CockroachDB now reports a `first_query` structured event on the TELEMETRY channel for the first SQL query executed on each server process.

knz · 2021-06-15T06:14:34Z

superseded by #66427

This change aims to expose “time to first query”, a product metric that correlates with how easy it is for a user to start using CockroachDB. The naive approach to do this would be to store a boolean "has the first query been reported in telemetry yet" inside a system table. This naive approach is defective because it incurs a storage read-eval-write cycle with an access to KV storage, or disk, or a fan-out across the network, upon every query. This is an unacceptable performance overhead. Instead, we want the collection of this metric to have virtually no performance impact on the common case of workloads beyond the first query. To get to a better design, we need to better understand the requirement. A naive interpretation of the requirement would be to assume that we need the time to first query to be reported *exactly once*. Besides the fact that this is strictly impossible from a computer-theoretical perspective [1], an exactly-once approach would also incur unacceptable performance overhead, due to the need to synchronize between multiple nodes, when the first query is sent through a load balancer. So what can we do instead? In an initial implementation (cockroachdb#63812), we tried to *approximate* “exactly once” reporting with a design that combines “at least once” with a best effort of “at most once”. This was achieved using a cluster setting. That solution was storing the time of the first query inside a cluster setting, and relying on the auto-propagation of setting changes through the cluster to announce to other nodes that the first query had been reported already. The use of a cluster setting also takes care of persisting this information across node restarts, again towards the need to approximate “at most once”. After further discussion with stakeholders, it appears that there is no strong requirement to reach “at most once”. The data processing pipeline that consumes the "first query" event is only interested to trigger the first time the event is recognized for a given cluster and can be easily configured to ignore events reported after that. The only practical consideration is to minimize the amount of telemetry data (to ensure scalability over the CockroachCloud fleet), and so we are not interested to send a "first query" event for *every single query executed* either (obviously). Based on this new understanding, this commit simplifies the approach taken in cockroachdb#63812: it makes every CockroachDB *process* report the first SQL query it executes on the TELEMETRY channel. Whether the process has reported the event already or not is stored in a memory-only atomic variable. This approach technically has "at most once" semantics per node (an event can get lost if the logging output is stalled at the moment the event is reported), but approximates "at least once" semantics very well for multi-node clusters, or when nodes restart over time—as, presumably, the logging system will not be clogged all the time. There are two open questions in this approach however: - when a CockroachCloud cluster is initially configured by the CC intrusion control plane), there are “administrative” queries that are *not internal* but also *not user initiated*. With the current patch, the "first query" event will be reported for these administrative SQL queries. Is this desired? (NB: the same question applied to cockroachdb#63812) If not, we will need to agree on a reliable way to identify a user-initiated query. (A similar question exists when the user opens `cockroach sql`, but does not type anything in. The `cockroach sql` command sends queries in the background, and would trigger the "first query" event even without user interaction. Are we OK with this?) - this approach will cause a "first query" event to be logged *at least once per SQL pod* (when multiple pods are running side-by-side), and also *once after every pod restart*. To retrieve the “time **to** first query” (i.e. not “time **of** first query”), the consumer of these events must then compute the minimum timestamp of all the first query events received, and substract the timestamp of cluster creation. (The cluster creation timestamp can be obtained either via an external source, e.g. when the customer initiated the orchestration to create the cluster, or by retrieving the first ``node_restart`` event from the OPS logging channel.) Is this behavior acceptable? Of note: departure from "at most once" delivery has another advantage with regards to product telemetry: it teaches us something about the effective utilization of all the nodes in a cluster. If we know the cluster has 10 nodes but only 3 report "first query" events, that might teach us that the cluster is over-provisioned. [1] in a distributed system, an event can have guaranteed at-least-once delivery, or at-most-once, but never both. We also have the same constraint for CDC. Release note (cli change): CockroachDB now reports a `first_query` structured event on the TELEMETRY channel for the first SQL query executed on each server process.

knz requested a review from jordanlewis April 17, 2021 13:38

knz force-pushed the 20210417-first-query-ts branch 2 times, most recently from a4182db to cfce208 Compare April 17, 2021 14:01

knz requested a review from vilamurugu April 17, 2021 14:30

knz mentioned this pull request Apr 17, 2021

backupccl: fix CheckEmittedEvents #63814

Merged

knz force-pushed the 20210417-first-query-ts branch from cfce208 to dcdc2eb Compare April 17, 2021 15:31

knz mentioned this pull request Apr 17, 2021

telemetry: move the diagnostic protos to structured logging + submit payloads to reg server for backward-compat #63815

Open

knz removed request for jordanlewis and vilamurugu June 14, 2021 14:00

knz mentioned this pull request Jun 14, 2021

util/log,cli: new TELEMETRY channel #66427

Merged

knz closed this Jun 15, 2021

knz mentioned this pull request Jul 20, 2021

sql,log: new structured event FirstQuery #67809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] server,sql: make the time-to-first-query observable in telemetry #63812

[WIP] server,sql: make the time-to-first-query observable in telemetry #63812

knz commented Apr 17, 2021 •

edited

Loading

cockroach-teamcity commented Apr 17, 2021

knz commented Apr 17, 2021

knz commented Apr 17, 2021

knz commented Jun 15, 2021

[WIP] server,sql: make the time-to-first-query observable in telemetry #63812

[WIP] server,sql: make the time-to-first-query observable in telemetry #63812

Conversation

knz commented Apr 17, 2021 • edited Loading

cockroach-teamcity commented Apr 17, 2021

knz commented Apr 17, 2021

knz commented Apr 17, 2021

knz commented Jun 15, 2021

knz commented Apr 17, 2021 •

edited

Loading