Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] server,sql: make the time-to-first-query observable in telemetry #63812

Closed
wants to merge 1 commit into from

Conversation

knz
Copy link
Contributor

@knz knz commented Apr 17, 2021

(inspired from the discussion on #55059 and a request by @vilamurugu )

This change aims to expose “time to first query”, a product
metric that correlates with how easy it is for a user
to start using CockroachDB.

The collection of “time to first query” is made complicated
to collect because of the following:

  • we cannot compute this entirely on one node using solely in-memory
    operations: we need to find the first time a query is executed across
    a fleet of nodes, and we need to remember the value when nodes get
    restarted.

  • we would like to avoid a read-eval-write cycle with an access to KV
    storage, or disk, or a fan-out across the network, upon every query. The
    collection of this metric should have virtually no performance impact
    on the common case of workloads beyond the first access.

  • the metric “time to first query” is different from “time of first
    query”: we want to track the amount of time between cluster
    creation and the time of first query.

  • we wish to avoid taking into account “internal” SQL queries, such
    as those issued internally by CockroachDB itself and those
    issued by helper/orchestration tools.

In order to reach the stated goal, this commit proposes the following
model:

  • to avoid “internal” SQL queries, the mechanism described below
    only looks at queries issued when the application_name is
    not “internal”, that is, does not start with the special
    prefix "$ ".

  • nodes keep track of their “time of first query” as
    follows:

    • an in-memory only timestamp is maintained and checked upon
      every query execution. Accesses are atomic to avoid
      any performance impact.

    • separately, an asynchronous background process
      (startSynchronizeFirstQueryTimestamp) synchronizes
      the in-memory timestamp with a new cluster setting
      sql.metrics.first_query.timestamp.

      The choice is made to use a cluster setting because these
      automatically synchronize between nodes and are
      automatically embarked in telemetry already.

  • separately and independently, the cluster now remembers its
    initialization time by storing a timestamp inside a new cluster
    setting sql.metrics.cluster_init.timestamp. This is set only once,
    upon cluster init.

    Again, the choice is made to use a cluster setting because these are
    automatically embarked in telemetry already.

Note that this change does not compute the time to first query,
but instead presents two different timestamp in telemetry
(from the cluster settings) which can be subtracted from each other
to reach the stated goal.

Here is an example use, using cockroach demo (nb: by default demo
uses an internal application_name):

```
$ cockroach demo

root> show cluster setting sql.metrics.cluster_init.timestamp;
sql.metrics.cluster_init.timestamp
--------------------------------------
                 1618666751186151817

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                                  0

root> set application_name = 'foo';

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                1618666852579067840

root> select 1;

root> -- observe: the first query timestamp does not change!

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                1618666852579067840
```

Note also that the new first_query.timestamp value is approximate.
This is explained by the following comment in the source code:

// The timestamp of the first query that was not issued by
// an "internal" client (as determined by application_name).
//
// This is set to non-zero by any node receiving a query if the
// value is observed to still be zero.
// Meanwhile, as for every cluster setting, it is possible for this
// value to take a while to propagate to the rest of this cluster.
// Therefore, it is possible for multiple nodes to concurrently
// write to the cluster setting, when they receive their first
// client query within a short time frame (the setting propagation
// delay).
// We find this acceptable because this propagation delay
// is short (seconds) whereas the information we hope to extract
// from this timestamp need only be at minutes/hours granularity.

See the release note below for a descriptionof impact.

Release note (sql change): CockroachDB now stores the time at which
the SQL logical layer of a cluster was initialized in the cluster
setting sql.metrics.cluster_init.timestamp.It also stores the
approximate time at which the first query from a SQL client is
executed in the new cluster setting
sql.metrics.first_query.timestamp.

This makes it possible to compute the time to first query since
cluster initialization, by subtracting the two values.

Note that the value is approximate because when the initial clients
connect to multiple nodes simultaneously, it is possible for multiple
nodes to concurrently estimate that they received the “first
query”. The uncertainty in this approximation is limited by the
propagation delay of cluster settings, which is typically a dozen of
seconds or less across the entire cluster.

@knz knz requested a review from jordanlewis April 17, 2021 13:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz knz force-pushed the 20210417-first-query-ts branch 2 times, most recently from a4182db to cfce208 Compare April 17, 2021 14:01
@knz knz requested a review from vilamurugu April 17, 2021 14:30
@knz
Copy link
Contributor Author

knz commented Apr 17, 2021

@jordanlewis heads up that I am delaying the implementation of unit testing until we reach some consensus about whether this approach is acceptable.

This change aims to expose “time to first query”, a product
metric that correlates with how easy it is for a user
to start using CockroachDB.

The collection of “time to first query” is made complicated
to collect because of the following:

- we cannot compute this entirely on one node using solely in-memory
  operations: we need to find the first time a query is executed across
  a fleet of nodes, and we need to remember the value when nodes get
  restarted.

- we would like to avoid a read-eval-write cycle with an access to KV
  storage, or disk, or a fan-out across the network, upon every query. The
  collection of this metric should have virtually no performance impact
  on the common case of workloads beyond the first access.

- the metric “time to first query” is different from “time *of* first
  query”: we want to track the *amount of time* between cluster
  creation and the time of first query.

- we wish to avoid taking into account “internal” SQL queries, such
  as those issued internally by CockroachDB itself and those
  issued by helper/orchestration tools.

In order to reach the stated goal, this commit proposes the following
model:

- to avoid “internal” SQL queries, the mechanism described below
  only looks at queries issued when the `application_name` is
  not “internal”, that is, does not start with the special
  prefix `"$ "`.

- nodes keep track of their “time *of* first query” as
  follows:

  - an in-memory only timestamp is maintained and checked upon
    every query execution. Accesses are atomic to avoid
    any performance impact.

  - separately, an asynchronous background process
    (`startSynchronizeFirstQueryTimestamp`) synchronizes
    the in-memory timestamp with a new cluster setting
    `sql.metrics.first_query.timestamp`.

    The choice is made to use a cluster setting because these
    automatically synchronize between nodes and are
    automatically embarked in telemetry already.

- separately and independently, the cluster now remembers its
  initialization time by storing a timestamp inside a new cluster
  setting `sql.metrics.cluster_init.timestamp`. This is set only once,
  upon cluster init.

  Again, the choice is made to use a cluster setting because these are
  automatically embarked in telemetry already.

Note that this change does not compute the time *to* first query,
but instead presents two different timestamp in telemetry
(from the cluster settings) which can be subtracted from each other
to reach the stated goal.

Here is an example use, using `cockroach demo` (nb: by default `demo`
uses an internal `application_name`):

```
$ cockroach demo

root> show cluster setting sql.metrics.cluster_init.timestamp;
sql.metrics.cluster_init.timestamp
--------------------------------------
                 1618666751186151817

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                                  0

root> set application_name = 'foo';

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                1618666852579067840

root> select 1;

root> -- observe: the first query timestamp does not change!

root> show cluster setting sql.metrics.first_query.timestamp;
  sql.metrics.first_query.timestamp
-------------------------------------
                1618666852579067840
```

Note also that the new `first_query.timestamp` value is approximate.
This is explained by the following comment in the source code:

```
// The timestamp of the first query that was not issued by
// an "internal" client (as determined by application_name).
//
// This is set to non-zero by any node receiving a query if the
// value is observed to still be zero.
// Meanwhile, as for every cluster setting, it is possible for this
// value to take a while to propagate to the rest of this cluster.
// Therefore, it is possible for multiple nodes to concurrently
// write to the cluster setting, when they receive their first
// client query within a short time frame (the setting propagation
// delay).
// We find this acceptable because this propagation delay
// is short (seconds) whereas the information we hope to extract
// from this timestamp need only be at minutes/hours granularity.
```

See the release note below for a descriptionof impact.

Release note (sql change): CockroachDB now stores the time at which
the SQL logical layer of a cluster was initialized in the cluster
setting `sql.metrics.cluster_init.timestamp`.It also stores the
approximate time *at* which the first query from a SQL client is
executed in the new cluster setting
`sql.metrics.first_query.timestamp`.

This makes it possible to compute the time *to* first query since
cluster initialization, by subtracting the two values.

Note that the value is approximate because when the initial clients
connect to multiple nodes simultaneously, it is possible for multiple
nodes to concurrently estimate that they received the “first
query”. The uncertainty in this approximation is limited by the
propagation delay of cluster settings, which is typically a dozen of
seconds or less across the entire cluster.
@knz
Copy link
Contributor Author

knz commented Apr 17, 2021

See this comment for an alternate approach: #63815 (comment)

craig bot pushed a commit that referenced this pull request Apr 20, 2021
63301: bootstrap: add cockroach/bin to Debian PATH r=rickystewart a=erikgrinaker

This makes it more convenient to use tools like roachprod on GCE
workers.

Release note: None

63656: doctor: add examine and recreate commands r=postamar a=postamar


    doctor: add verbose flag
    
    This commit adds a --verbose flag to the debug doctor CLI command
    which increases output verbosity.
    
    Release note (cli change): added --verbose flag to debug doctor command.


    doctor: fix string comparison bug
    
    Previously, doctor could panic when outputting validation failures with
    error messages under a certain length. This commit fixes this by using
    the proper string prefix comparison function, strings.HasPrefix.
    
    Release note: None


    sql: add checks in unsafe_upsert_descriptor
    
    Previously, it was quite easy to cause a node to crash by passing an
    invalid descriptor argument to the crdb_internal.unsafe_upsert_descriptor
    repair function. This commit makes this function more resilient to
    invalid inputs by changing the descpb.GetDescriptorMetadata function to
    return errors instead of panicking.
    
    Release note: None


    doctor: add examine and recreate commands
    
    The doctor tool can be very helpful for troubleshooting schema
    corruption issues. These investigations often involve loading the
    corrupted descriptor(s) into a local test environment to either better
    understand the root cause or to craft a valid descriptor for a repair.
    In any case, until now this required taking the descriptor hex encodings
    from the debug.zip and updating the system.descriptor table. This can be
    tedious and error-prone. Recent improvements to descriptor validation
    logic have also made it harder to deliberately corrupt descriptors,
    which has unfortunate side-effects in this use case.
    
    This commit changes the doctor CLI commands, adding the 'examine' and
    'recreate' commands. The former did what doctor used to do, the latter
    will output a series of SQL statements. When executed on an empty cluster
    these will populate it with all of the non-system descriptors and
    namespace table entries.
    
    These descriptors are almost as they were, except:
    - the version counter is reset to 1,
    - the modification time is unset,
    - the create-as-of time is unset (tables only).
    Maintaining the original values would violate too many assumptions to be
    safe.
    
    Fixes #62459.
    
    Release note (cli change): The doctor debug tool now has new 'examine'
    and 'recreate' commands. The new 'debug doctor examine' invocation
    replaces the old 'debug doctor' invocation. The new
    'debug doctor recreate' invocation prints SQL statements which will
    recreate the descriptor and namespace state when run in a new, empty
    cluster.


63748: server: remove the setting `server.remote_debugging.mode` r=bdarnell a=knz

Fixes  #57072.
Fixes #26283
Fixes #24992

As suggested/requested by @bdarnell.

Release note (security update): Certain HTTP debug endpoints reserved
to `admin` users now return more details about range start/end keys,
such as the "hot ranges" report.

Release note (security update): The cluster setting
`server.remote_debugging.mode` has been removed. The debug endpoints
are now available to every client with access to the HTTP port.
All the HTTP URLs previously affected by this setting already have
user authentication and require a user to be logged in that's member
of the `admin` group, so there was no need for an additional layer of
security.

63754: roachtest: add restore2TB config with large nodes r=pbardea a=pbardea

This test adds a new configuration to the restore2TB roachtest to test
the veritical scaling capabilities of restore on large nodes.

Release note: None

63814: backupccl: fix CheckEmittedEvents r=adityamaru a=knz

Found while working on #63812.

This ensures the test does not panic if there are more entries in the
log than expected.

Release note: None

63933: roachtest: skip psycopg password_encryption test r=RichardJCai a=rafiss

This is being skipped upstream (psycopg/psycopg2#1263)
but until that is released, we mark it as expected to fail.

fixes #63893

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Paul Bardea <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@knz knz removed request for jordanlewis and vilamurugu June 14, 2021 14:00
knz added a commit to knz/cockroach that referenced this pull request Jun 14, 2021
This change aims to expose “time to first query”, a product metric
that correlates with how easy it is for a user to start using
CockroachDB.

The naive approach to do this would be to store a boolean "has the
first query been reported in telemetry yet" inside a system table.
This naive approach is defective because it incurs a storage
read-eval-write cycle with an access to KV storage, or disk, or a
fan-out across the network, upon every query. This is an unacceptable
performance overhead.

Instead, we want the collection of this metric to have virtually no
performance impact on the common case of workloads beyond the first
query.

To get to a better design, we need to better understand the
requirement.  A naive interpretation of the requirement would be to
assume that we need the time to first query to be reported *exactly
once*.

Besides the fact that this is strictly impossible from a
computer-theoretical perspective [1], an exactly-once approach would
also incur unacceptable performance overhead, due to the need to
synchronize between multiple nodes, when the first query is sent
through a load balancer.

So what can we do instead?

In an initial implementation (cockroachdb#63812), we tried to *approximate*
“exactly once” reporting with a design that combines “at least once”
with a best effort of “at most once”. This was achieved using a
cluster setting. That solution was storing the time of the first query
inside a cluster setting, and relying on the auto-propagation of
setting changes through the cluster to announce to other nodes that
the first query had been reported already.

The use of a cluster setting also takes care of persisting this
information across node restarts, again towards the need to
approximate “at most once”.

After further discussion with stakeholders, it appears that there
is no strong requirement to reach “at most once”. The data processing
pipeline that consumes the "first query" event is only interested to
trigger the first time the event is recognized for a given cluster and
can be easily configured to ignore events reported after that.

The only practical consideration is to minimize the amount of
telemetry data (to ensure scalability over the CockroachCloud fleet),
and so we are not interested to send a "first query" event for *every
single query executed* either (obviously).

Based on this new understanding, this commit simplifies the approach
taken in cockroachdb#63812: it makes every CockroachDB *process* report the first
SQL query it executes on the TELEMETRY channel. Whether the process
has reported the event already or not is stored in a memory-only
atomic variable.

This approach technically has "at most once" semantics per node (an
event can get lost if the logging output is stalled at the moment the
event is reported), but approximates "at least once" semantics very
well for multi-node clusters, or when nodes restart over time—as,
presumably, the logging system will not be clogged all the time.

There are two open questions in this approach however:

- when a CockroachCloud cluster is initially configured by the CC
  intrusion control plane), there are “administrative” queries that
  are *not internal* but also *not user initiated*.

  With the current patch, the "first query" event will be reported
  for these administrative SQL queries. Is this desired?
  (NB: the same question applied to cockroachdb#63812)

  If not, we will need to agree on a reliable way to identify
  a user-initiated query.

  (A similar question exists when the user opens `cockroach sql`, but
  does not type anything in. The `cockroach sql` command sends queries
  in the background, and would trigger the "first query" event even
  without user interaction. Are we OK with this?)

- this approach will cause a "first query" event to be logged *at
  least once per SQL pod* (when multiple pods are running side-by-side),
  and also *once after every pod restart*.

  To retrieve the “time **to** first query” (i.e. not “time **of**
  first query”), the consumer of these events must then compute the
  minimum timestamp of all the first query events received, and
  substract the timestamp of cluster creation.  (The cluster creation
  timestamp can be obtained either via an external source, e.g. when
  the customer initiated the orchestration to create the cluster, or
  by retrieving the first ``node_restart`` event from the OPS logging
  channel.)

  Is this behavior acceptable?

Of note: departure from "at most once" delivery has another advantage
with regards to product telemetry: it teaches us something about the
effective utilization of all the nodes in a cluster. If we know the
cluster has 10 nodes but only 3 report "first query" events, that
might teach us that the cluster is over-provisioned.

[1] in a distributed system, an event can have guaranteed
at-least-once delivery, or at-most-once, but never both. We also have
the same constraint for CDC.

Release note (cli change): CockroachDB now reports a `first_query`
structured event on the TELEMETRY channel for the first SQL query
executed on each server process.
@knz
Copy link
Contributor Author

knz commented Jun 15, 2021

superseded by #66427

@knz knz closed this Jun 15, 2021
knz added a commit to knz/cockroach that referenced this pull request Jul 20, 2021
This change aims to expose “time to first query”, a product metric
that correlates with how easy it is for a user to start using
CockroachDB.

The naive approach to do this would be to store a boolean "has the
first query been reported in telemetry yet" inside a system table.
This naive approach is defective because it incurs a storage
read-eval-write cycle with an access to KV storage, or disk, or a
fan-out across the network, upon every query. This is an unacceptable
performance overhead.

Instead, we want the collection of this metric to have virtually no
performance impact on the common case of workloads beyond the first
query.

To get to a better design, we need to better understand the
requirement.  A naive interpretation of the requirement would be to
assume that we need the time to first query to be reported *exactly
once*.

Besides the fact that this is strictly impossible from a
computer-theoretical perspective [1], an exactly-once approach would
also incur unacceptable performance overhead, due to the need to
synchronize between multiple nodes, when the first query is sent
through a load balancer.

So what can we do instead?

In an initial implementation (cockroachdb#63812), we tried to *approximate*
“exactly once” reporting with a design that combines “at least once”
with a best effort of “at most once”. This was achieved using a
cluster setting. That solution was storing the time of the first query
inside a cluster setting, and relying on the auto-propagation of
setting changes through the cluster to announce to other nodes that
the first query had been reported already.

The use of a cluster setting also takes care of persisting this
information across node restarts, again towards the need to
approximate “at most once”.

After further discussion with stakeholders, it appears that there
is no strong requirement to reach “at most once”. The data processing
pipeline that consumes the "first query" event is only interested to
trigger the first time the event is recognized for a given cluster and
can be easily configured to ignore events reported after that.

The only practical consideration is to minimize the amount of
telemetry data (to ensure scalability over the CockroachCloud fleet),
and so we are not interested to send a "first query" event for *every
single query executed* either (obviously).

Based on this new understanding, this commit simplifies the approach
taken in cockroachdb#63812: it makes every CockroachDB *process* report the first
SQL query it executes on the TELEMETRY channel. Whether the process
has reported the event already or not is stored in a memory-only
atomic variable.

This approach technically has "at most once" semantics per node (an
event can get lost if the logging output is stalled at the moment the
event is reported), but approximates "at least once" semantics very
well for multi-node clusters, or when nodes restart over time—as,
presumably, the logging system will not be clogged all the time.

There are two open questions in this approach however:

- when a CockroachCloud cluster is initially configured by the CC
  intrusion control plane), there are “administrative” queries that
  are *not internal* but also *not user initiated*.

  With the current patch, the "first query" event will be reported
  for these administrative SQL queries. Is this desired?
  (NB: the same question applied to cockroachdb#63812)

  If not, we will need to agree on a reliable way to identify
  a user-initiated query.

  (A similar question exists when the user opens `cockroach sql`, but
  does not type anything in. The `cockroach sql` command sends queries
  in the background, and would trigger the "first query" event even
  without user interaction. Are we OK with this?)

- this approach will cause a "first query" event to be logged *at
  least once per SQL pod* (when multiple pods are running side-by-side),
  and also *once after every pod restart*.

  To retrieve the “time **to** first query” (i.e. not “time **of**
  first query”), the consumer of these events must then compute the
  minimum timestamp of all the first query events received, and
  substract the timestamp of cluster creation.  (The cluster creation
  timestamp can be obtained either via an external source, e.g. when
  the customer initiated the orchestration to create the cluster, or
  by retrieving the first ``node_restart`` event from the OPS logging
  channel.)

  Is this behavior acceptable?

Of note: departure from "at most once" delivery has another advantage
with regards to product telemetry: it teaches us something about the
effective utilization of all the nodes in a cluster. If we know the
cluster has 10 nodes but only 3 report "first query" events, that
might teach us that the cluster is over-provisioned.

[1] in a distributed system, an event can have guaranteed
at-least-once delivery, or at-most-once, but never both. We also have
the same constraint for CDC.

Release note (cli change): CockroachDB now reports a `first_query`
structured event on the TELEMETRY channel for the first SQL query
executed on each server process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants