changefeedccl: add changefeed.lagging_ranges metric #109835

jayshrivastava · 2023-08-31T18:03:22Z

This change adds the changefeed.lagging_ranges metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
lagging_ranges_threshold which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option lagging_ranges_polling_interval which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than changefeed.lagging_ranges_threshold. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero changefeed.lagging_ranges until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric changefeed.lagging_ranges is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the metrics_label changefeed
option. A changefeed option lagging_ranges_threshold is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new option lagging_ranges_polling_interval which controls how often the lagging ranges
calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: None

cockroach-teamcity · 2023-08-31T18:03:33Z

This change is

miretskiy

Reviewed 7 of 12 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

pkg/ccl/changefeedccl/metrics.go line 788 at r1 (raw file):

func (s *sliMetrics) getLaggingRangesCallback() func(int64) {
	// Because this gauge is shared between changefeeds in the same metrics scope,
	// must instead modify it using `Inc` and `Dec` (as opposed to `Update`) to

nit: something is missing (e.g. care must be take, or we must instead)

pkg/ccl/changefeedccl/metrics.go line 800 at r1 (raw file):

	// If 3 ranges catch up, last=10,i=7: X.Dec(10 - 7) = X.Dec(3)
	// If 4 ranges fall behind, last=7,i=11: X.Dec(7 - 11) = X.Inc(4)
	// If 1 lagging range is deleted, last=7,i=10: X.Dec(11-10) = X.Dec(1)

very nice explannation.

pkg/ccl/changefeedccl/kvfeed/testing_knobs.go line 37 at r1 (raw file):

	ModifyTimestamps func(*hlc.Timestamp)
	// RangefeedOptions which will be passed to the rangefeed which lies under
	// the kvfeed.

nit: can you rephrase "lies" part? just sounds a bit strange. "Which lets kvfeed override regular settings" or some such?

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 97 at r1 (raw file):

// LaggingRangesThreshold is how far behind a range must be from the present to
// be considered as 'lagging' behind in metrics
var LaggingRangesThreshold = settings.RegisterDurationSetting(

There are 2 things I'm pretty sure about:

These two settings do not belong here. The configuration option you added can be passed in these thresholds .
I'm not sure these should be settings in the first place. Perhaps these should be changefeed options with some default values assigned to them?

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 288 at r1 (raw file):

func (ds *DistSender) monitorLaggingRanges(
	ctx context.Context, rr *rangeFeedRegistry, updateLaggingRanges func(int64),
) error {

There are few things that I think should be changed.

First, I think the responsibility of managing the frequency of this function, plus determination
if/how this function should be called (i.e. use goroutine on a timer, or not) belongs with the caller. I just don't
think we should be adding additional goroutines to rangefeed itself. That can be easily done by changefeed
(kvfeed).

This will allow you to completely remove the settings from distsender (which I mention above).

The second change I would like to see is that, I think ultimately the entire logic of the counting case also belongs with changefeed. Your code to calculate things is fine, of course. It's just this code should not live here. It's too changefeed specific. Instead, I think you can easily use the refactoring you already did (introduction of ForEachPartialRangefeed method), and do something like this:

type ForEachRangeFn func(fn ActiveRangeFeedIterFn) error

// Change WithLaggingRangesUpdate configuration option to be something like:

func WithRangeObserver(observer func(ForEachRangeFn) RangeFeedOption {
    return optionFunc(func(c *rangeFeedConfig) {
		c.rangeObserver = observer
	})
}

// Then, once you have setup rangefeed registry (rr):
if cfg.rangeObserver != nil {
  cfg.rangeObserver(rr.ForEachPartialRangefeed)
}

That's it -- rangefeed library will invoke a function that will be passed an observer function that the caller
may invoke when it wants to examine the state of all ranges that are part of the rangfeed.

The majority of this function simply moves to changefeed code, which I think is much better.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 306 at r1 (raw file):

				// we consider the time the partial rangefeed was created to be its starting time.
				ts := hlc.Timestamp{WallTime: feed.CreatedTime.UnixNano()}
				if !feed.Resolved.EqOrdering(hlc.Timestamp{}) {

if feed.Resolved.IsSet() ...

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 399 at r1 (raw file):

// ForEachPartialRangefeed invokes provided function for each partial rangefeed. Use manageIterationErrs
// if the fn uses iterutil.StopIteration to stop iteration.

this manageIterationErrs bit is might strange. Why wouldn't we want to use stopIteration unconditionally?

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/ccl/changefeedccl/metrics.go line 800 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

very nice explannation.

Done.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 97 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

There are 2 things I'm pretty sure about:

These two settings do not belong here. The configuration option you added can be passed in these thresholds .

I'm not sure these should be settings in the first place. Perhaps these should be changefeed options with some default values assigned to them?

Done.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 288 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

There are few things that I think should be changed.

First, I think the responsibility of managing the frequency of this function, plus determination
if/how this function should be called (i.e. use goroutine on a timer, or not) belongs with the caller. I just don't
think we should be adding additional goroutines to rangefeed itself. That can be easily done by changefeed
(kvfeed).

This will allow you to completely remove the settings from distsender (which I mention above).

The second change I would like to see is that, I think ultimately the entire logic of the counting case also belongs with changefeed. Your code to calculate things is fine, of course. It's just this code should not live here. It's too changefeed specific. Instead, I think you can easily use the refactoring you already did (introduction of ForEachPartialRangefeed method), and do something like this:
type ForEachRangeFn func(fn ActiveRangeFeedIterFn) error

// Change WithLaggingRangesUpdate configuration option to be something like:

func WithRangeObserver(observer func(ForEachRangeFn) RangeFeedOption {
    return optionFunc(func(c *rangeFeedConfig) {
		c.rangeObserver = observer
	})
}

// Then, once you have setup rangefeed registry (rr):
if cfg.rangeObserver != nil {
  cfg.rangeObserver(rr.ForEachPartialRangefeed)
}
That's it -- rangefeed library will invoke a function that will be passed an observer function that the caller
may invoke when it wants to examine the state of all ranges that are part of the rangfeed.

The majority of this function simply moves to changefeed code, which I think is much better.

Ack. I agree. I just pushed an update with these changes.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 306 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

if feed.Resolved.IsSet() ...

Done.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 399 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

this manageIterationErrs bit is might strange. Why wouldn't we want to use stopIteration unconditionally?

I got rid of it. It was a bit odd because it's called by another iterator, but I think it's okay to not manage iteration errors by default.

miretskiy

Reviewed 11 of 12 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

pkg/ccl/changefeedccl/changefeed_processors.go line 468 at r2 (raw file):

	updateLaggingRanges := sliMetrics.getLaggingRangesCallback()
	return func(fn kvcoord.ForEachRangeFn) {
		go func() {

I think it would be much nicer if kvFeed created this observer instead of processor.

First, kvfeed deals with rangefeeds -- so, it's a natural place for that logic.
Secondly, you don't need to worry about your stray go function not noticing things
like context cancellation.
Finally, you definitely want this observer to run with the same goroutines started
by kvfeed because you want this observer to terminate as soon as the rangefeed itself terminates.

pkg/ccl/changefeedccl/changefeed_processors.go line 478 at r2 (raw file):

				select {
				case <-ctx.Done():
					log.Infof(ctx, "changefeed lagging ranges observer shutting down: %s", ctx.Err())

not sure this (and the above) log lines are needed; if you insist, perhaps if log.V(1)

pkg/ccl/changefeedccl/changefeed_processors.go line 483 at r2 (raw file):

					count := int64(0)
					thresholdTS := timeutil.Now().Add(-1 * changefeedbase.LaggingRangesThreshold.Get(&settings.SV))
					i := 0

is this used?

pkg/ccl/changefeedccl/changefeedbase/settings.go line 314 at r2 (raw file):

// LaggingRangesCheckFrequency is the frequency at which a changefeed will
// check for ranges which have fallen behind.
var LaggingRangesCheckFrequency = settings.RegisterDurationSetting(

should these be changefeed options instead of settings?
Not much of a change, but options will let you configure more sensitive changefeed with
lower thresholds.

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go line 100 at r2 (raw file):

	if cfg.rangeObserver != nil {
		cfg.rangeObserver(rr.ForEachPartialRangefeed)
	}

not sure we need to do this here... Why not call this right after rr created (in dist_sender_rangefeed)?

Assuming you do this in kvfeed, we have "g" (ctxgroup) that runs rangefeed and schema feed. So, your option
could be

WithRangeFeedObserver(func(observe kvcoord.ForEachRangeFn) {
    g.GoCtx(func(ctx context.Context) {
        return laggingRangesObserver(....., fn);
   }
)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 92 at r2 (raw file):

}

// ForEachRangeFn is used to iterate o

nit: needs better comment.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 852 at r2 (raw file):

				for {
					select {
					case <-ctx2.Done():

this is a busy loop; while I understand it's a test, it probably makes it a lot more aggressive.
Just have a simple timer (200ms?)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 863 at r2 (raw file):

					})
					observedRangesMu.Unlock()
					require.NoError(t, err)

few things: first, it's not great to panic (or use require) on anything other than the test thread itself (it could lead to deadlocks.

Second: not a fan of stray go func. I think under race this may even show up as a leaked goroutine since you're not waiting for the goroutine to finish -- you just cancel context; and if leakchecker runs before that, you'll get an error.

I think you should just use context group.

g := ctxgroup.WithContext(ctx2)
defer func() {
   cancel()
   _ = g.Wait() // Ignore result since it's going to be cancel; you just want to make sure you wait for the goroutine to exit prior to returning
}()

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 888 at r2 (raw file):

				observedRangesMu.Lock()
				defer observedRangesMu.Unlock()
				for r := range expectedRanges {

perhaps use reflect.DeepEqual(a, b)?

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 910 at r2 (raw file):

			makeSpan("1/{4-5}"): {},
			makeSpan("{1/5-2}"): {},
		}

nice test

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/ccl/changefeedccl/changefeed_processors.go line 468 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I think it would be much nicer if kvFeed created this observer instead of processor.

First, kvfeed deals with rangefeeds -- so, it's a natural place for that logic.
Secondly, you don't need to worry about your stray go function not noticing things
like context cancellation.
Finally, you definitely want this observer to run with the same goroutines started
by kvfeed because you want this observer to terminate as soon as the rangefeed itself terminates.

Done. Now we create it in the same ctx group.

pkg/ccl/changefeedccl/changefeed_processors.go line 478 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

not sure this (and the above) log lines are needed; if you insist, perhaps if log.V(1)

I got rid of it.

pkg/ccl/changefeedccl/changefeed_processors.go line 483 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

is this used?

Nope. Removed.

pkg/ccl/changefeedccl/changefeedbase/settings.go line 314 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

should these be changefeed options instead of settings?
Not much of a change, but options will let you configure more sensitive changefeed with
lower thresholds.

Done.

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go line 100 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

not sure we need to do this here... Why not call this right after rr created (in dist_sender_rangefeed)?

Assuming you do this in kvfeed, we have "g" (ctxgroup) that runs rangefeed and schema feed. So, your option
could be
WithRangeFeedObserver(func(observe kvcoord.ForEachRangeFn) {
    g.GoCtx(func(ctx context.Context) {
        return laggingRangesObserver(....., fn);
   }
)

Done.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 852 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

this is a busy loop; while I understand it's a test, it probably makes it a lot more aggressive.
Just have a simple timer (200ms?)

Ack. Done.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 863 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

few things: first, it's not great to panic (or use require) on anything other than the test thread itself (it could lead to deadlocks.

Second: not a fan of stray go func. I think under race this may even show up as a leaked goroutine since you're not waiting for the goroutine to finish -- you just cancel context; and if leakchecker runs before that, you'll get an error.

I think you should just use context group.
g := ctxgroup.WithContext(ctx2)
defer func() {
   cancel()
   _ = g.Wait() // Ignore result since it's going to be cancel; you just want to make sure you wait for the goroutine to exit prior to returning
}()

Done. I assert that the error is "context cancelled" now. The require happens in a t.Run goroutine which I believe is safe.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 888 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

perhaps use reflect.DeepEqual(a, b)?

Done.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 910 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

nice test

Ty!

jayshrivastava

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be  
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric  
for each range in the table. However, as ranges complete the initial scan, the number of ranges will  
decrease.

@miretskiy Do you think it would be worth having a separate metric which ignores initial scans? I can imagine an operator not wanting this noise.

Alternatively, they could start the changefeed under one metrics label and change the label when it completes, but restarting a big changefeed takes time and there still might be some noise if they try that.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

jayshrivastava

On second thought, maybe it's not worth it. We have the checkpoint_progress sli metric which ignores initial scans. 2.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

miretskiy

Reviewed 8 of 10 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)

pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 38 at r3 (raw file):

)

type LaggingRangesPollerCfg struct {

comment on the exported struct (or just get rid of the struct).

pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 44 at r3 (raw file):

	// LaggingRangesPollingInterval is how often the kv feed will poll for
	// lagging ranges and update lagging ranges metrics.
	LaggingRangesPollingInterval time.Duration

nit: if you keep the struct, it's probably not necessary to prefix member names with laggingrangespolling...

pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 126 at r3 (raw file):

		sc, pff, bf, cfg.UseMux, cfg.Targets, cfg.Knobs)
	f.onBackfillCallback = cfg.OnBackfillCallback
	f.rangeObserver = makeLaggingRangesObserver(g, cfg.LaggingRangesPollerCfg)

nit: perhaps rename startLaggingRangesObserver?

pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 185 at r3 (raw file):

				case <-ctx.Done():
					return ctx.Err()
				case <-time.After(cfg.LaggingRangesPollingInterval):

I prefer not to use time.After (time.After for example does not notice ctx cancellation until time expires); consider creating timer instead.

pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go line 96 at r3 (raw file):

	if cfg.knobs.metrics != nil {
		m.metrics = cfg.knobs.metrics
	}

nit: looks like this file only has white space changes -- consider reverting.

This change adds the `changefeed.lagging_ranges` metric which can be used to track ranges which are behind. This metric is calculated based on a new changefeed option `lagging_ranges_threshold` which is the amount of time that a range checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes. This change also adds the changefeed option `lagging_ranges_polling_interval` which is the polling rate at which a rangefeed will poll for lagging ranges and update the metric. This defaults to 1 minute. Sometimes a range may not have any checkpoints for a while because the start time may be far in the past (this causes a catchup scan during which no checkpoints are emitted). In this case, the range is considered to the lagging if the created timestamp of the rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that any changefeed which starts with an initial scan containing a significant amount of data will likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This is intentional. Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed option. A changefeed option `lagging_ranges_threshold` is added which is the amount of time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also a new option `lagging_ranges_polling_interval` which controls how often the lagging ranges calculation is done. This setting defaults to polling every 1 minute. Note that polling adds latency to the metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterwards. Also note that ranges undergoing an initial scan for longer than the threshold are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. However, as ranges complete the initial scan, the number of ranges will decrease. Epic: None

jayshrivastava · 2023-09-06T16:19:08Z

bors r=miretskiy

craig · 2023-09-06T16:59:21Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-09-06T16:59:32Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from b5ec7df to blathers/backport-release-22.2-109835: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

error creating merge commit from b5ec7df to blathers/backport-release-23.1-109835: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

This change is a backport of the commit from cockroachdb#109835. The original chance uses changefeed options to configure lagging ranges metrics. Because changefeed options require version gates, they cannot be backported. This change instead uses cluster settings. This change adds the `changefeed.lagging_ranges` metric which can be used to track ranges which are behind. This metric is calculated based on a new changefeed option `lagging_ranges_threshold` which is the amount of time that a range checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes. This change also adds the changefeed option `lagging_ranges_polling_interval` which is the polling rate at which a rangefeed will poll for lagging ranges and update the metric. This defaults to 1 minute. Sometimes a range may not have any checkpoints for a while because the start time may be far in the past (this causes a catchup scan during which no checkpoints are emitted). In this case, the range is considered to the lagging if the created timestamp of the rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that any changefeed which starts with an initial scan containing a significant amount of data will likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This is intentional. Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed option. A new cluser setting `changefeed.lagging_ranges_threshold` is added which is the amount of time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also a new cluster setting `changefeed.lagging_ranges_polling_interval` which controls how often the lagging ranges calculation is done. This setting defaults to polling every 1 minute. Note that polling adds latency to the metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterwards. Also note that ranges undergoing an initial scan for longer than the threshold are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. However, as ranges complete the initial scan, the number of ranges will decrease. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181

This change is a backport of the commit from cockroachdb#109835. The original chance uses changefeed options to configure lagging ranges metrics. Because changefeed options require version gates, they cannot be backported. This change instead uses cluster settings. This change adds the `changefeed.lagging_ranges` metric which can be used to track ranges which are behind. This metric is calculated based on a new changefeed option `lagging_ranges_threshold` which is the amount of time that a range checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes. This change also adds the changefeed option `lagging_ranges_polling_interval` which is the polling rate at which a rangefeed will poll for lagging ranges and update the metric. This defaults to 1 minute. Sometimes a range may not have any checkpoints for a while because the start time may be far in the past (this causes a catchup scan during which no checkpoints are emitted). In this case, the range is considered to the lagging if the created timestamp of the rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that any changefeed which starts with an initial scan containing a significant amount of data will likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This is intentional. Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed option. A new cluser setting `changefeed.lagging_ranges_threshold` is added which is the amount of time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also a new cluster setting `changefeed.lagging_ranges_polling_interval` which controls how often the lagging ranges calculation is done. This setting defaults to polling every 1 minute. Note that polling adds latency to the metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterwards. Also note that ranges undergoing an initial scan for longer than the threshold are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. However, as ranges complete the initial scan, the number of ranges will decrease. Epic: CRDB-9181

jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch 6 times, most recently from 2341592 to 6e29018 Compare September 2, 2023 18:43

jayshrivastava changed the title ~~Jayshrivastava/lagging range metrics~~ changefeedccl: add changefeed.lagging_ranges metric Sep 2, 2023

jayshrivastava requested a review from miretskiy September 2, 2023 18:46

jayshrivastava marked this pull request as ready for review September 2, 2023 18:46

jayshrivastava requested a review from a team September 2, 2023 18:46

jayshrivastava requested a review from a team as a code owner September 2, 2023 18:46

miretskiy suggested changes Sep 2, 2023

View reviewed changes

jayshrivastava mentioned this pull request Sep 2, 2023

changefeedccl: data race in test helpers #109966

Closed

jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch from 6e29018 to 3b592af Compare September 2, 2023 20:59

jayshrivastava commented Sep 2, 2023

View reviewed changes

jayshrivastava requested a review from miretskiy September 2, 2023 21:01

jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch from 3b592af to be0e279 Compare September 2, 2023 21:02

miretskiy suggested changes Sep 2, 2023

View reviewed changes

jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch 6 times, most recently from d04c370 to b0725c6 Compare September 5, 2023 13:54

jayshrivastava commented Sep 5, 2023

View reviewed changes

jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch from b0725c6 to 57d7e0a Compare September 5, 2023 14:05

jayshrivastava commented Sep 5, 2023

View reviewed changes

jayshrivastava requested a review from miretskiy September 5, 2023 14:10

jayshrivastava commented Sep 5, 2023

View reviewed changes

miretskiy approved these changes Sep 5, 2023

View reviewed changes

jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch from 57d7e0a to b5ec7df Compare September 6, 2023 15:24

jayshrivastava added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Sep 6, 2023

craig bot merged commit 4c90f3b into cockroachdb:master Sep 6, 2023

cockroach-teamcity mentioned this pull request Sep 7, 2023

PR #109835 - changefeedccl: add changefeed.lagging_ranges metric cockroachdb/docs#17823

Open

This was referenced Sep 20, 2023

release-22.2: changefeedccl: add changefeed.lagging_ranges metric #110970

Merged

release 23.1: changefeedccl: add changefeed.lagging_ranges metric #110963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: add changefeed.lagging_ranges metric #109835

changefeedccl: add changefeed.lagging_ranges metric #109835

jayshrivastava commented Aug 31, 2023 •

edited

Loading

cockroach-teamcity commented Aug 31, 2023

miretskiy left a comment

jayshrivastava left a comment

miretskiy left a comment

jayshrivastava left a comment

jayshrivastava left a comment

jayshrivastava left a comment

miretskiy left a comment

jayshrivastava commented Sep 6, 2023

craig bot commented Sep 6, 2023

blathers-crl bot commented Sep 6, 2023

changefeedccl: add changefeed.lagging_ranges metric #109835

changefeedccl: add changefeed.lagging_ranges metric #109835

Conversation

jayshrivastava commented Aug 31, 2023 • edited Loading

cockroach-teamcity commented Aug 31, 2023

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava commented Sep 6, 2023

craig bot commented Sep 6, 2023

blathers-crl bot commented Sep 6, 2023

jayshrivastava commented Aug 31, 2023 •

edited

Loading