-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: changefeedccl: add changefeed.lagging_ranges
metric
#110970
release-22.2: changefeedccl: add changefeed.lagging_ranges
metric
#110970
Conversation
changefeed.lagging_ranges
metricchangefeed.lagging_ranges
metric
a256bcc
to
33f4450
Compare
1e839c6
to
97f7793
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 12 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @jayshrivastava)
-- commits
line 3 at r1:
Let's be explicit that this is a backport;
Also, let's describe how this backport differs from master version.
Finally, prior to giving a 👍 on this backport, let's be triple sure this works fine on 22.2 branch.
I would like to ask you to run some of the roachtests on this branch, and post some screenshots to make
sure these features (and the newly added settings) actually work on this branch.
-- commits
line 35 at r1:
We probably should use CRDB-9181
pkg/ccl/changefeedccl/metrics.go
line 744 at r1 (raw file):
// If 4 ranges fall behind, last=7,i=11: X.Dec(7 - 11) = X.Inc(4) // If 1 lagging range is deleted, last=7,i=10: X.Dec(11-10) = X.Dec(1) var last int64
you need to pick up changes that made this thread safe -- see on master.
This change is a backport of the commit from cockroachdb#109835. The original chance uses changefeed options to configure lagging ranges metrics. Because changefeed options require version gates, they cannot be backported. This change instead uses cluster settings. This change adds the `changefeed.lagging_ranges` metric which can be used to track ranges which are behind. This metric is calculated based on a new changefeed option `lagging_ranges_threshold` which is the amount of time that a range checkpoint needs to be in the past to be considered lagging. This defaults to 3 minutes. This change also adds the changefeed option `lagging_ranges_polling_interval` which is the polling rate at which a rangefeed will poll for lagging ranges and update the metric. This defaults to 1 minute. Sometimes a range may not have any checkpoints for a while because the start time may be far in the past (this causes a catchup scan during which no checkpoints are emitted). In this case, the range is considered to the lagging if the created timestamp of the rangefeed is older than `changefeed.lagging_ranges_threshold`. Note that this means that any changefeed which starts with an initial scan containing a significant amount of data will likely indicate nonzero `changefeed.lagging_ranges` until the initial scan is complete. This is intentional. Release note (ops change): A new metric `changefeed.lagging_ranges` is added to show the number of ranges which are behind in changefeeds. This metric can be used with the `metrics_label` changefeed option. A new cluser setting `changefeed.lagging_ranges_threshold` is added which is the amount of time a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also a new cluster setting `changefeed.lagging_ranges_polling_interval` which controls how often the lagging ranges calculation is done. This setting defaults to polling every 1 minute. Note that polling adds latency to the metric being updated. For example, if a range falls behind by 3 minutes, the metric may not update until an additional minute afterwards. Also note that ranges undergoing an initial scan for longer than the threshold are considered to be lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric for each range in the table. However, as ranges complete the initial scan, the number of ranges will decrease. Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181
Fix a race bug in lagging spans metric. Fixes: cockroachdb#110235 Release note: None
97f7793
to
41f05f7
Compare
@miretskiy The graph is below with thresholds 2s, 4s, 3.1s from left to right. |
These settings were added to 23.1 and 22.2 in patch versions via cockroachdb#110963 and cockroachdb#110970. These settings will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2. Release note: None Epic: None
110980: settings: retire changefeed lagging ranges settings r=jayshrivastava a=jayshrivastava These settings were added to 23.1 and 22.2 in patch versions via #110963 and #110970. These settings will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2. Release note: None Epic: None Co-authored-by: Jayant Shrivastava <[email protected]>
These settings were added to 23.1 and 22.2 in patch versions via cockroachdb#110963 and cockroachdb#110970. These settings will not exists in 23.2 onwards, so this commit adds them to the retired settings list for 23.2. Release note: None Epic: None
Backport commits from #109835, #110250
This change is a backport of the commit from #109835. The original chance uses
changefeed options to configure lagging ranges metrics. Because changefeed options
require version gates, they cannot be backported. This change instead uses cluster
settings.
This change adds the
changefeed.lagging_ranges
metric which can be used to trackranges which are behind. This metric is calculated based on a new changefeed option
lagging_ranges_threshold
which is the amount of time that a rangecheckpoint needs to be in the past to be considered lagging. This defaults to 3 minutes.
This change also adds the changefeed option
lagging_ranges_polling_interval
which isthe polling rate at which a rangefeed will poll for lagging ranges and update the metric.
This defaults to 1 minute.
Sometimes a range may not have any checkpoints for a while because the start time
may be far in the past (this causes a catchup scan during which no checkpoints are emitted).
In this case, the range is considered to the lagging if the created timestamp of the
rangefeed is older than
changefeed.lagging_ranges_threshold
. Note that this means thatany changefeed which starts with an initial scan containing a significant amount of data will
likely indicate nonzero
changefeed.lagging_ranges
until the initial scan is complete. Thisis intentional.
Release note (ops change): A new metric
changefeed.lagging_ranges
is added to show the number ofranges which are behind in changefeeds. This metric can be used with the
metrics_label
changefeedoption. A new cluser setting
changefeed.lagging_ranges_threshold
is added which is the amount oftime a range needs to be behind to be considered lagging. By default this is 3 minutes. There is also
a new cluster setting
changefeed.lagging_ranges_polling_interval
which controls how oftenthe lagging ranges calculation is done. This setting defaults to polling every 1 minute.
Note that polling adds latency to the metric being updated. For example, if a range falls behind
by 3 minutes, the metric may not update until an additional minute afterwards.
Also note that ranges undergoing an initial scan for longer than the threshold are considered to be
lagging. Starting a changefeed with an initial scan on a large table will likely increment the metric
for each range in the table. However, as ranges complete the initial scan, the number of ranges will
decrease.
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-9181
Release justification: Customer ask.