Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: Backport changefeed observability PRs. #68106

Merged

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented Jul 27, 2021

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release Justification: A low danger, high impact observability change. The added metrics allow us to troubleshoot
changefeed performance, when running in large scale clusters.

@miretskiy miretskiy requested a review from a team as a code owner July 27, 2021 14:15
@blathers-crl
Copy link

blathers-crl bot commented Jul 27, 2021

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy miretskiy changed the title release-21.1: TODO release-21.1: Backport changefeed observability PRs. Jul 27, 2021
@@ -20,7 +20,7 @@ import (

// MaxSettings is the maximum number of settings that the system supports.
// Exported for tests.
const MaxSettings = 512
const MaxSettings = 288
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing we don't want to decrease this :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this persisted somewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not persisted, but if we register more settings than the max a panic is thrown. My guess is a few PRs bumped it when we hit the old limit, so we should probably stick with the higher value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed one of the commits since the setting was increased multiple times.

@miretskiy miretskiy force-pushed the backport21.1-63923-67268 branch from 8cc30ed to 9bf7575 Compare July 27, 2021 15:03
stevendanna and others added 5 commits July 27, 2021 11:18
This adds 3 new histogram metrics to try to get some insight into
whether we are seeing occasionally slow sinks or checkpoints during
various nightly roachtests.

While this produces some data duplication with the previous
flush_nanos, flushes, emitted_messages, and emit_nanos metrics, I
think that having the histogram will still be useful to more easily
see if a changefeed experienced a small number of slow flushes or
checkpoints.

The naming of these metrics is a bit unfortunate, but since
flush_nanos and emit_nanos already existed and since I didn't want to
remove them, I've included 'hist' in the name of these new metrics.

Release note: None
This adds a new cluster setting

    changfeed.slow_span_log_threshold

That allows us to control the threshold for logging slow spans. This
is useful for cases where the auto-calculated threshold is much higher
than we would like.

Release note (sql change): A new cluster setting
`changefeed.slow_span_log_threshold` allows setting a cluster-wide
default for slow span logging.
I occasionally find this useful to know how many observations a given
histogram is based on. The prometheus output already returns this, but
it is nice to have it in the SQL and JSON output as well.

Release note (ops change): Histogram metrics now store the total
number of observations over time.
Stop relying on ExportRequestLimit to determine the number of concurrent
export requests, and introduce a decidated ScanRequestLimit setting.

If the setting is specified, uses that setting; otherwise, the default
value is computed as 3 * (number of nodes in the cluster), which is the
old behavior, but we cap this number so that concurrency does not get
out of hand if running in a very large cluster.

Fixes cockroachdb#67190

Release Nodes: Provide a better configurability of scan request
concurrency.  Scan requests are issued by changefeeds during the
backfill.
Add a metric to keep track of the number of frontier updates in the
changefeed.

Release Notes: None
@miretskiy miretskiy force-pushed the backport21.1-63923-67268 branch from 9bf7575 to a8bea98 Compare July 27, 2021 15:18
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics changes have been baking a while and are pretty low risk. The changes to the export request limit seems pretty straightforward and high value for large clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants