release-22.2: changefeedccl: add `changefeed.lagging_ranges` metric #110970

jayshrivastava · 2023-09-20T15:37:03Z

Backport commits from #109835, #110250

This change is a backport of the commit from #109835. The original chance uses
changefeed options to configure lagging ranges metrics. Because changefeed options
require version gates, they cannot be backported. This change instead uses cluster
settings.

This change adds the changefeed.lagging_ranges metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
lagging_ranges_threshold which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option lagging_ranges_polling_interval which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than changefeed.lagging_ranges_threshold. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero changefeed.lagging_ranges until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric changefeed.lagging_ranges is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the metrics_label changefeed
option. A new cluser setting changefeed.lagging_ranges_threshold is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new cluster setting changefeed.lagging_ranges_polling_interval which controls how often
the lagging ranges calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181

Release justification: Customer ask.

cockroach-teamcity · 2023-09-20T15:37:10Z

This change is

miretskiy

Reviewed 2 of 12 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @jayshrivastava)

-- commits line 3 at r1:
Let's be explicit that this is a backport;
Also, let's describe how this backport differs from master version.

Finally, prior to giving a 👍 on this backport, let's be triple sure this works fine on 22.2 branch.
I would like to ask you to run some of the roachtests on this branch, and post some screenshots to make
sure these features (and the newly added settings) actually work on this branch.

-- commits line 35 at r1:
We probably should use CRDB-9181

pkg/ccl/changefeedccl/metrics.go line 744 at r1 (raw file):

	// If 4 ranges fall behind, last=7,i=11: X.Dec(7 - 11) = X.Inc(4)
	// If 1 lagging range is deleted, last=7,i=10: X.Dec(11-10) = X.Dec(1)
	var last int64

you need to pick up changes that made this thread safe -- see on master.

This change is a backport of the commit from cockroachdb#109835. The original chance uses changefeed options to configure lagging ranges metrics. Because changefeed options require version gates, they cannot be backported. This change instead uses cluster settings. This change adds the `changefeed.lagging_ranges` metric which can be used to track ranges which are behind. This metric is calculated based on a new changefeed option `lagging_ranges_threshold` which is the amount of time that a range checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes. This change also adds the changefeed option `lagging_ranges_polling_interval` which is the polling rate at which a rangefeed will poll for lagging ranges and update the metric. This defaults to 1 minute. Sometimes a range may not have any checkpoints for a while because the start time may be far in the past (this causes a catchup scan during which no checkpoints are emitted). In this case, the range is considered to the lagging if the created timestamp of the rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that any changefeed which starts with an initial scan containing a significant amount of data will likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This is intentional. Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed option. A new cluser setting `changefeed.lagging_ranges_threshold` is added which is the amount of time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also a new cluster setting `changefeed.lagging_ranges_polling_interval` which controls how often the lagging ranges calculation is done. This setting defaults to polling every 1 minute. Note that polling adds latency to the metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterwards. Also note that ranges undergoing an initial scan for longer than the threshold are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. However, as ranges complete the initial scan, the number of ranges will decrease. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181

Fix a race bug in lagging spans metric. Fixes: cockroachdb#110235 Release note: None

jayshrivastava · 2023-09-26T22:04:48Z

@miretskiy The graph is below with thresholds 2s, 4s, 3.1s from left to right.

These settings were added to 23.1 and 22.2 in patch versions via cockroachdb#110963 and cockroachdb#110970. These settings will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2. Release note: None Epic: None

110980: settings: retire changefeed lagging ranges settings r=jayshrivastava a=jayshrivastava These settings were added to 23.1 and 22.2 in patch versions via #110963 and #110970. These settings will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2. Release note: None Epic: None Co-authored-by: Jayant Shrivastava <[email protected]>

These settings were added to 23.1 and 22.2 in patch versions via cockroachdb#110963 and cockroachdb#110970. These settings will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2. Release note: None Epic: None

jayshrivastava changed the title ~~changefeedccl: add changefeed.lagging_ranges metric~~ release-22.2: changefeedccl: add changefeed.lagging_ranges metric Sep 20, 2023

jayshrivastava force-pushed the lagging-range-backport-release-22.2 branch from a256bcc to 33f4450 Compare September 20, 2023 16:19

jayshrivastava mentioned this pull request Sep 20, 2023

settings: retire changefeed lagging ranges settings #110980

Merged

jayshrivastava force-pushed the lagging-range-backport-release-22.2 branch from 1e839c6 to 97f7793 Compare September 20, 2023 18:12

jayshrivastava requested a review from miretskiy September 20, 2023 18:12

jayshrivastava marked this pull request as ready for review September 20, 2023 18:12

jayshrivastava requested a review from a team September 20, 2023 18:12

jayshrivastava requested review from a team as code owners September 20, 2023 18:12

jayshrivastava requested review from dhartunian and removed request for a team September 20, 2023 18:12

miretskiy suggested changes Sep 20, 2023

View reviewed changes

samiskin and others added 2 commits September 26, 2023 09:48

changefeedccl: Fix data race in lagging spans metric

41f05f7

Fix a race bug in lagging spans metric. Fixes: cockroachdb#110235 Release note: None

jayshrivastava force-pushed the lagging-range-backport-release-22.2 branch from 97f7793 to 41f05f7 Compare September 26, 2023 13:50

miretskiy approved these changes Sep 26, 2023

View reviewed changes

jayshrivastava merged commit ad4e261 into cockroachdb:release-22.2 Sep 26, 2023
2 checks passed

jayshrivastava deleted the lagging-range-backport-release-22.2 branch September 27, 2023 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.2: changefeedccl: add `changefeed.lagging_ranges` metric #110970

release-22.2: changefeedccl: add `changefeed.lagging_ranges` metric #110970

jayshrivastava commented Sep 20, 2023 •

edited

Loading

cockroach-teamcity commented Sep 20, 2023

miretskiy left a comment

jayshrivastava commented Sep 26, 2023

release-22.2: changefeedccl: add changefeed.lagging_ranges metric #110970

release-22.2: changefeedccl: add changefeed.lagging_ranges metric #110970

Conversation

jayshrivastava commented Sep 20, 2023 • edited Loading

cockroach-teamcity commented Sep 20, 2023

miretskiy left a comment

Choose a reason for hiding this comment

jayshrivastava commented Sep 26, 2023

release-22.2: changefeedccl: add `changefeed.lagging_ranges` metric #110970

release-22.2: changefeedccl: add `changefeed.lagging_ranges` metric #110970

jayshrivastava commented Sep 20, 2023 •

edited

Loading