Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add changefeed.lagging_ranges metric #109835

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Aug 31, 2023

This change adds the changefeed.lagging_ranges metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
lagging_ranges_threshold which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option lagging_ranges_polling_interval which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than changefeed.lagging_ranges_threshold. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero changefeed.lagging_ranges until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric changefeed.lagging_ranges is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the metrics_label changefeed
option. A changefeed option lagging_ranges_threshold is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new option lagging_ranges_polling_interval which controls how often the lagging ranges
calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch 6 times, most recently from 2341592 to 6e29018 Compare September 2, 2023 18:43
@jayshrivastava jayshrivastava changed the title Jayshrivastava/lagging range metrics changefeedccl: add changefeed.lagging_ranges metric Sep 2, 2023
@jayshrivastava jayshrivastava marked this pull request as ready for review September 2, 2023 18:46
@jayshrivastava jayshrivastava requested a review from a team September 2, 2023 18:46
@jayshrivastava jayshrivastava requested a review from a team as a code owner September 2, 2023 18:46
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 12 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)


pkg/ccl/changefeedccl/metrics.go line 788 at r1 (raw file):

func (s *sliMetrics) getLaggingRangesCallback() func(int64) {
	// Because this gauge is shared between changefeeds in the same metrics scope,
	// must instead modify it using `Inc` and `Dec` (as opposed to `Update`) to

nit: something is missing (e.g. care must be take, or we must instead)


pkg/ccl/changefeedccl/metrics.go line 800 at r1 (raw file):

	// If 3 ranges catch up, last=10,i=7: X.Dec(10 - 7) = X.Dec(3)
	// If 4 ranges fall behind, last=7,i=11: X.Dec(7 - 11) = X.Inc(4)
	// If 1 lagging range is deleted, last=7,i=10: X.Dec(11-10) = X.Dec(1)

very nice explannation.


pkg/ccl/changefeedccl/kvfeed/testing_knobs.go line 37 at r1 (raw file):

	ModifyTimestamps func(*hlc.Timestamp)
	// RangefeedOptions which will be passed to the rangefeed which lies under
	// the kvfeed.

nit: can you rephrase "lies" part? just sounds a bit strange. "Which lets kvfeed override regular settings" or some such?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 97 at r1 (raw file):

// LaggingRangesThreshold is how far behind a range must be from the present to
// be considered as 'lagging' behind in metrics
var LaggingRangesThreshold = settings.RegisterDurationSetting(

There are 2 things I'm pretty sure about:

  1. These two settings do not belong here. The configuration option you added can be passed in these thresholds .
  2. I'm not sure these should be settings in the first place. Perhaps these should be changefeed options with some default values assigned to them?

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 288 at r1 (raw file):

func (ds *DistSender) monitorLaggingRanges(
	ctx context.Context, rr *rangeFeedRegistry, updateLaggingRanges func(int64),
) error {

There are few things that I think should be changed.

First, I think the responsibility of managing the frequency of this function, plus determination
if/how this function should be called (i.e. use goroutine on a timer, or not) belongs with the caller. I just don't
think we should be adding additional goroutines to rangefeed itself. That can be easily done by changefeed
(kvfeed).

This will allow you to completely remove the settings from distsender (which I mention above).

The second change I would like to see is that, I think ultimately the entire logic of the counting case also belongs with changefeed. Your code to calculate things is fine, of course. It's just this code should not live here. It's too changefeed specific. Instead, I think you can easily use the refactoring you already did (introduction of ForEachPartialRangefeed method), and do something like this:

type ForEachRangeFn func(fn ActiveRangeFeedIterFn) error

// Change WithLaggingRangesUpdate configuration option to be something like:

func WithRangeObserver(observer func(ForEachRangeFn) RangeFeedOption {
    return optionFunc(func(c *rangeFeedConfig) {
		c.rangeObserver = observer
	})
}

// Then, once you have setup rangefeed registry (rr):
if cfg.rangeObserver != nil {
  cfg.rangeObserver(rr.ForEachPartialRangefeed)
}

That's it -- rangefeed library will invoke a function that will be passed an observer function that the caller
may invoke when it wants to examine the state of all ranges that are part of the rangfeed.

The majority of this function simply moves to changefeed code, which I think is much better.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 306 at r1 (raw file):

				// we consider the time the partial rangefeed was created to be its starting time.
				ts := hlc.Timestamp{WallTime: feed.CreatedTime.UnixNano()}
				if !feed.Resolved.EqOrdering(hlc.Timestamp{}) {

if feed.Resolved.IsSet() ...


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 399 at r1 (raw file):

// ForEachPartialRangefeed invokes provided function for each partial rangefeed. Use manageIterationErrs
// if the fn uses iterutil.StopIteration to stop iteration.

this manageIterationErrs bit is might strange. Why wouldn't we want to use stopIteration unconditionally?

Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/changefeedccl/metrics.go line 800 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

very nice explannation.

Done.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 97 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

There are 2 things I'm pretty sure about:

  1. These two settings do not belong here. The configuration option you added can be passed in these thresholds .
  2. I'm not sure these should be settings in the first place. Perhaps these should be changefeed options with some default values assigned to them?

Done.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 288 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

There are few things that I think should be changed.

First, I think the responsibility of managing the frequency of this function, plus determination
if/how this function should be called (i.e. use goroutine on a timer, or not) belongs with the caller. I just don't
think we should be adding additional goroutines to rangefeed itself. That can be easily done by changefeed
(kvfeed).

This will allow you to completely remove the settings from distsender (which I mention above).

The second change I would like to see is that, I think ultimately the entire logic of the counting case also belongs with changefeed. Your code to calculate things is fine, of course. It's just this code should not live here. It's too changefeed specific. Instead, I think you can easily use the refactoring you already did (introduction of ForEachPartialRangefeed method), and do something like this:

type ForEachRangeFn func(fn ActiveRangeFeedIterFn) error

// Change WithLaggingRangesUpdate configuration option to be something like:

func WithRangeObserver(observer func(ForEachRangeFn) RangeFeedOption {
    return optionFunc(func(c *rangeFeedConfig) {
		c.rangeObserver = observer
	})
}

// Then, once you have setup rangefeed registry (rr):
if cfg.rangeObserver != nil {
  cfg.rangeObserver(rr.ForEachPartialRangefeed)
}

That's it -- rangefeed library will invoke a function that will be passed an observer function that the caller
may invoke when it wants to examine the state of all ranges that are part of the rangfeed.

The majority of this function simply moves to changefeed code, which I think is much better.

Ack. I agree. I just pushed an update with these changes.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 306 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

if feed.Resolved.IsSet() ...

Done.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 399 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

this manageIterationErrs bit is might strange. Why wouldn't we want to use stopIteration unconditionally?

I got rid of it. It was a bit odd because it's called by another iterator, but I think it's okay to not manage iteration errors by default.

@jayshrivastava jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch from 3b592af to be0e279 Compare September 2, 2023 21:02
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 12 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)


pkg/ccl/changefeedccl/changefeed_processors.go line 468 at r2 (raw file):

	updateLaggingRanges := sliMetrics.getLaggingRangesCallback()
	return func(fn kvcoord.ForEachRangeFn) {
		go func() {

I think it would be much nicer if kvFeed created this observer instead of processor.

First, kvfeed deals with rangefeeds -- so, it's a natural place for that logic.
Secondly, you don't need to worry about your stray go function not noticing things
like context cancellation.
Finally, you definitely want this observer to run with the same goroutines started
by kvfeed because you want this observer to terminate as soon as the rangefeed itself terminates.


pkg/ccl/changefeedccl/changefeed_processors.go line 478 at r2 (raw file):

				select {
				case <-ctx.Done():
					log.Infof(ctx, "changefeed lagging ranges observer shutting down: %s", ctx.Err())

not sure this (and the above) log lines are needed; if you insist, perhaps if log.V(1)


pkg/ccl/changefeedccl/changefeed_processors.go line 483 at r2 (raw file):

					count := int64(0)
					thresholdTS := timeutil.Now().Add(-1 * changefeedbase.LaggingRangesThreshold.Get(&settings.SV))
					i := 0

is this used?


pkg/ccl/changefeedccl/changefeedbase/settings.go line 314 at r2 (raw file):

// LaggingRangesCheckFrequency is the frequency at which a changefeed will
// check for ranges which have fallen behind.
var LaggingRangesCheckFrequency = settings.RegisterDurationSetting(

should these be changefeed options instead of settings?
Not much of a change, but options will let you configure more sensitive changefeed with
lower thresholds.


pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go line 100 at r2 (raw file):

	if cfg.rangeObserver != nil {
		cfg.rangeObserver(rr.ForEachPartialRangefeed)
	}

not sure we need to do this here... Why not call this right after rr created (in dist_sender_rangefeed)?

Assuming you do this in kvfeed, we have "g" (ctxgroup) that runs rangefeed and schema feed. So, your option
could be

WithRangeFeedObserver(func(observe kvcoord.ForEachRangeFn) {
    g.GoCtx(func(ctx context.Context) {
        return laggingRangesObserver(....., fn);
   }
)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 92 at r2 (raw file):

}

// ForEachRangeFn is used to iterate o

nit: needs better comment.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 852 at r2 (raw file):

				for {
					select {
					case <-ctx2.Done():

this is a busy loop; while I understand it's a test, it probably makes it a lot more aggressive.
Just have a simple timer (200ms?)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 863 at r2 (raw file):

					})
					observedRangesMu.Unlock()
					require.NoError(t, err)

few things: first, it's not great to panic (or use require) on anything other than the test thread itself (it could lead to deadlocks.

Second: not a fan of stray go func. I think under race this may even show up as a leaked goroutine since you're not waiting for the goroutine to finish -- you just cancel context; and if leakchecker runs before that, you'll get an error.

I think you should just use context group.

g := ctxgroup.WithContext(ctx2)
defer func() {
   cancel()
   _ = g.Wait() // Ignore result since it's going to be cancel; you just want to make sure you wait for the goroutine to exit prior to returning
}()

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 888 at r2 (raw file):

				observedRangesMu.Lock()
				defer observedRangesMu.Unlock()
				for r := range expectedRanges {

perhaps use reflect.DeepEqual(a, b)?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 910 at r2 (raw file):

			makeSpan("1/{4-5}"): {},
			makeSpan("{1/5-2}"): {},
		}

nice test

@jayshrivastava jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch 6 times, most recently from d04c370 to b0725c6 Compare September 5, 2023 13:54
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/changefeedccl/changefeed_processors.go line 468 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I think it would be much nicer if kvFeed created this observer instead of processor.

First, kvfeed deals with rangefeeds -- so, it's a natural place for that logic.
Secondly, you don't need to worry about your stray go function not noticing things
like context cancellation.
Finally, you definitely want this observer to run with the same goroutines started
by kvfeed because you want this observer to terminate as soon as the rangefeed itself terminates.

Done. Now we create it in the same ctx group.


pkg/ccl/changefeedccl/changefeed_processors.go line 478 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

not sure this (and the above) log lines are needed; if you insist, perhaps if log.V(1)

I got rid of it.


pkg/ccl/changefeedccl/changefeed_processors.go line 483 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

is this used?

Nope. Removed.


pkg/ccl/changefeedccl/changefeedbase/settings.go line 314 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

should these be changefeed options instead of settings?
Not much of a change, but options will let you configure more sensitive changefeed with
lower thresholds.

Done.


pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go line 100 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

not sure we need to do this here... Why not call this right after rr created (in dist_sender_rangefeed)?

Assuming you do this in kvfeed, we have "g" (ctxgroup) that runs rangefeed and schema feed. So, your option
could be

WithRangeFeedObserver(func(observe kvcoord.ForEachRangeFn) {
    g.GoCtx(func(ctx context.Context) {
        return laggingRangesObserver(....., fn);
   }
)

Done.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 852 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

this is a busy loop; while I understand it's a test, it probably makes it a lot more aggressive.
Just have a simple timer (200ms?)

Ack. Done.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 863 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

few things: first, it's not great to panic (or use require) on anything other than the test thread itself (it could lead to deadlocks.

Second: not a fan of stray go func. I think under race this may even show up as a leaked goroutine since you're not waiting for the goroutine to finish -- you just cancel context; and if leakchecker runs before that, you'll get an error.

I think you should just use context group.

g := ctxgroup.WithContext(ctx2)
defer func() {
   cancel()
   _ = g.Wait() // Ignore result since it's going to be cancel; you just want to make sure you wait for the goroutine to exit prior to returning
}()

Done. I assert that the error is "context cancelled" now. The require happens in a t.Run goroutine which I believe is safe.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 888 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

perhaps use reflect.DeepEqual(a, b)?

Done.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 910 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

nice test

Ty!

@jayshrivastava jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch from b0725c6 to 57d7e0a Compare September 5, 2023 14:05
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be  
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric  
for each range in the table. However, as ranges complete the initial scan, the number of ranges will  
decrease.

@miretskiy Do you think it would be worth having a separate metric which ignores initial scans? I can imagine an operator not wanting this noise.

Alternatively, they could start the changefeed under one metrics label and change the label when it completes, but restarting a big changefeed takes time and there still might be some noise if they try that.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, maybe it's not worth it. We have the checkpoint_progress sli metric which ignores initial scans. 2.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 10 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava)


pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 38 at r3 (raw file):

)

type LaggingRangesPollerCfg struct {

comment on the exported struct (or just get rid of the struct).


pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 44 at r3 (raw file):

	// LaggingRangesPollingInterval is how often the kv feed will poll for
	// lagging ranges and update lagging ranges metrics.
	LaggingRangesPollingInterval time.Duration

nit: if you keep the struct, it's probably not necessary to prefix member names with laggingrangespolling...


pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 126 at r3 (raw file):

		sc, pff, bf, cfg.UseMux, cfg.Targets, cfg.Knobs)
	f.onBackfillCallback = cfg.OnBackfillCallback
	f.rangeObserver = makeLaggingRangesObserver(g, cfg.LaggingRangesPollerCfg)

nit: perhaps rename startLaggingRangesObserver?


pkg/ccl/changefeedccl/kvfeed/kv_feed.go line 185 at r3 (raw file):

				case <-ctx.Done():
					return ctx.Err()
				case <-time.After(cfg.LaggingRangesPollingInterval):

I prefer not to use time.After (time.After for example does not notice ctx cancellation until time expires); consider creating timer instead.


pkg/kv/kvclient/kvcoord/dist_sender_mux_rangefeed.go line 96 at r3 (raw file):

	if cfg.knobs.metrics != nil {
		m.metrics = cfg.knobs.metrics
	}

nit: looks like this file only has white space changes -- consider reverting.

This change adds the `changefeed.lagging_ranges` metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
`lagging_ranges_threshold` which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option `lagging_ranges_polling_interval` which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed
option. A changefeed option `lagging_ranges_threshold` is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new option `lagging_ranges_polling_interval` which controls how often the lagging ranges
calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: None
@jayshrivastava jayshrivastava force-pushed the jayshrivastava/lagging-range-metrics branch from 57d7e0a to b5ec7df Compare September 6, 2023 15:24
@jayshrivastava jayshrivastava added backport-22.2.x backport-23.1.x Flags PRs that need to be backported to 23.1 labels Sep 6, 2023
@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy

@craig
Copy link
Contributor

craig bot commented Sep 6, 2023

Build succeeded:

@craig craig bot merged commit 4c90f3b into cockroachdb:master Sep 6, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 6, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from b5ec7df to blathers/backport-release-22.2-109835: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


error creating merge commit from b5ec7df to blathers/backport-release-23.1-109835: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

jayshrivastava pushed a commit to jayshrivastava/cockroach that referenced this pull request Sep 26, 2023
This change is a backport of the commit from cockroachdb#109835. The original chance uses
changefeed options to configure lagging ranges metrics. Because changefeed options
require version gates, they cannot be backported. This change instead uses cluster
settings.

This change adds the `changefeed.lagging_ranges` metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
`lagging_ranges_threshold` which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option `lagging_ranges_polling_interval` which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed
option. A new cluser setting `changefeed.lagging_ranges_threshold` is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new cluster setting `changefeed.lagging_ranges_polling_interval` which controls how often
the lagging ranges calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181
jayshrivastava pushed a commit to jayshrivastava/cockroach that referenced this pull request Sep 26, 2023
This change is a backport of the commit from cockroachdb#109835. The original chance uses
changefeed options to configure lagging ranges metrics. Because changefeed options
require version gates, they cannot be backported. This change instead uses cluster
settings.

This change adds the `changefeed.lagging_ranges` metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
`lagging_ranges_threshold` which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option `lagging_ranges_polling_interval` which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed
option. A new cluser setting `changefeed.lagging_ranges_threshold` is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new cluster setting `changefeed.lagging_ranges_polling_interval` which controls how often
the lagging ranges calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: CRDB-9181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants