Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: changefeedccl: add changefeed.lagging_ranges metric #110970

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Sep 20, 2023

Backport commits from #109835, #110250


This change is a backport of the commit from #109835. The original chance uses
changefeed options to configure lagging ranges metrics. Because changefeed options
require version gates, they cannot be backported. This change instead uses cluster
settings.

This change adds the changefeed.lagging_ranges metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
lagging_ranges_threshold which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option lagging_ranges_polling_interval which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than changefeed.lagging_ranges_threshold. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero changefeed.lagging_ranges until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric changefeed.lagging_ranges is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the metrics_label changefeed
option. A new cluser setting changefeed.lagging_ranges_threshold is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new cluster setting changefeed.lagging_ranges_polling_interval which controls how often
the lagging ranges calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181

Release justification: Customer ask.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava changed the title changefeedccl: add changefeed.lagging_ranges metric release-22.2: changefeedccl: add changefeed.lagging_ranges metric Sep 20, 2023
@jayshrivastava jayshrivastava force-pushed the lagging-range-backport-release-22.2 branch from a256bcc to 33f4450 Compare September 20, 2023 16:19
@jayshrivastava jayshrivastava force-pushed the lagging-range-backport-release-22.2 branch from 1e839c6 to 97f7793 Compare September 20, 2023 18:12
@jayshrivastava jayshrivastava marked this pull request as ready for review September 20, 2023 18:12
@jayshrivastava jayshrivastava requested a review from a team September 20, 2023 18:12
@jayshrivastava jayshrivastava requested review from a team as code owners September 20, 2023 18:12
@jayshrivastava jayshrivastava requested review from dhartunian and removed request for a team September 20, 2023 18:12
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 12 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @jayshrivastava)


-- commits line 3 at r1:
Let's be explicit that this is a backport;
Also, let's describe how this backport differs from master version.

Finally, prior to giving a 👍 on this backport, let's be triple sure this works fine on 22.2 branch.
I would like to ask you to run some of the roachtests on this branch, and post some screenshots to make
sure these features (and the newly added settings) actually work on this branch.


-- commits line 35 at r1:
We probably should use CRDB-9181


pkg/ccl/changefeedccl/metrics.go line 744 at r1 (raw file):

	// If 4 ranges fall behind, last=7,i=11: X.Dec(7 - 11) = X.Inc(4)
	// If 1 lagging range is deleted, last=7,i=10: X.Dec(11-10) = X.Dec(1)
	var last int64

you need to pick up changes that made this thread safe -- see on master.

samiskin and others added 2 commits September 26, 2023 09:48
This change is a backport of the commit from cockroachdb#109835. The original chance uses
changefeed options to configure lagging ranges metrics. Because changefeed options
require version gates, they cannot be backported. This change instead uses cluster
settings.

This change adds the `changefeed.lagging_ranges` metric which can be used to track
ranges which are behind. This metric is calculated based on a new changefeed option
`lagging_ranges_threshold` which is the amount of time that a range
checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option `lagging_ranges_polling_interval` which is
the polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.

Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that
any changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This
is intentional.

Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of
ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed
option. A new cluser setting `changefeed.lagging_ranges_threshold` is added which is the amount of
time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new cluster setting `changefeed.lagging_ranges_polling_interval` which controls how often
the lagging ranges calculation is done. This setting defaults to polling every 1 minute.

Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.

Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.

Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181
Fix a race bug in lagging spans metric.

Fixes: cockroachdb#110235

Release note: None
@jayshrivastava jayshrivastava force-pushed the lagging-range-backport-release-22.2 branch from 97f7793 to 41f05f7 Compare September 26, 2023 13:50
@jayshrivastava
Copy link
Contributor Author

@miretskiy The graph is below with thresholds 2s, 4s, 3.1s from left to right.
image

@jayshrivastava jayshrivastava merged commit ad4e261 into cockroachdb:release-22.2 Sep 26, 2023
2 checks passed
@jayshrivastava jayshrivastava deleted the lagging-range-backport-release-22.2 branch September 27, 2023 03:23
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this pull request Sep 27, 2023
These settings were added to 23.1 and 22.2 in patch versions via cockroachdb#110963 and cockroachdb#110970. These settings
will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2.

Release note: None
Epic: None
craig bot pushed a commit that referenced this pull request Sep 27, 2023
110980: settings: retire changefeed lagging ranges settings r=jayshrivastava a=jayshrivastava

These settings were added to 23.1 and 22.2 in patch versions via #110963 and #110970. These settings will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2.

Release note: None
Epic: None

Co-authored-by: Jayant Shrivastava <[email protected]>
THardy98 pushed a commit to THardy98/cockroach that referenced this pull request Oct 6, 2023
These settings were added to 23.1 and 22.2 in patch versions via cockroachdb#110963 and cockroachdb#110970. These settings
will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2.

Release note: None
Epic: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants