cdc: Stuck changefeed #108040

miretskiy · 2023-08-02T18:08:59Z

As reported by @erikgrinaker , running experiment emitting into null:// sink deadlocked changefeed.
Relevant stacks below:

goroutine 6459467 [select, 108 minutes]:
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).Get(0xc0b481b880, {0x7881e68, 0xc01bf92b40})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:202 +0x165
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed.copyFromSourceToDestUntilTableEvent({0x7881e68, 0xc01bf92b40}, {0x7f1b0ddc1fa8?, 0xc094dcbdb8?}, {0x7f1aff8edf78, 0xc0b481b880}, 0xc0308be640?, {0x785e190?, 0xc04efa3b00?}, {0x17778b07a2024400, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed/kv_feed.go:707 +0x282
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed.(*kvFeed).runUntilTableEvent.func3({0x7881e68, 0xc01bf92b40})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed/kv_feed.go:510 +0x13f
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5


goroutine 6459469 [select, 108 minutes]:
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Acquire(0xc04b0309a0, {0x7881e68, 0xc01bf92d40}, {0x7857ba0, 0xc0b481b8a8})
	github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:281 +0x75c
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).AcquireMemory.func1(0x3cbcbcc?, 0xc0a0e29380?, {0x7881e68?, 0xc01bf92d40?})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:241 +0xeb
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).AcquireMemory(0xc0b481b880, {0x7881e68?, 0xc01bf92d40?}, 0x1a4)
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:245 +0xd0
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).Add(0xc0b481b880, {0x7881e68, 0xc01bf92d40}, {0xc08dbf1410, 0x2, {0x0, 0x0, 0x0}, {0xc12aaaecadd02bce, 0x2c44208c122, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:268 +0x265
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed.(*rangefeed).addEventsToBuffer(0xc04b030a50, {0x7881e68?, 0xc01bf92d40?})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed/physical_kv_feed.go:109 +0x4cc
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5

Jira issue: CRDB-30296

Epic CRDB-11783

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-08-02T18:09:02Z

cc @cockroachdb/cdc

erikgrinaker · 2023-08-02T18:13:42Z

Reproduces readily with the following roachtest from #107722:

cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100000/protocol=rangefeed/format=json/sink=null

It splits off 100.000 ranges, ingests 1 billion rows, then creates the following changefeed:

CREATE CHANGEFEED FOR kv.kv INTO 'null://' WITH cursor = <before_inserts>, end_time = <now>, format = 'json', min_checkpoint_frequency = '10s';

The changefeed will stop making progress after a few minutes, with catchup scans blocked sending to the stream, because the stream processors are blocked on quota acquisition as shown above. This happens across all nodes.

104228: dev: replace `stress` with `--runs_per_test` r=rail a=rickystewart `--runs_per_test` does the same thing that `stress` does (runs a test many times to weed out flakes), but better, in that: 1. Better Bazel support -- we had to hack Bazel support into `stress` to keep it working 2. Bazel gives us statistics and control over how the tests are scheduled in a way that is standardized as opposed to the ad-hoc arguments one can pass to `stress` 3. Bazel can collect logs and artifacts for different test runs and expose them to the user in a way that `stress` cannot 4. Drop-in support for remote execution when we have access to it, obviating the need for `roachprod-stress` For now, the default implementation of `dev test --stress` is to run the test 1,000 times, stopping the build if any test fails. This is meant to replicate how people use `stress` (i.e., just start it running and stop if there are any failures). The 1,000 figure is arbitrary and meant to just be a "very high number" of times to run unit tests. The default figure of 1,000 can be adjusted with `--count`. Also update documentation with some new recipes and add extra logging to warn about the change in behavior. Closes #102879 Epic: none Release note: None 107765: sql: remove calls to CreateTestServerParams r=yuzefovich a=yuzefovich This PR removes calls to `tests.CreateTestServerParams` outside of `sql_test` tests as well as does some miscellaneous cleanups along the way. The rationale for doing this is that vast majority of callers didn't actually need the setup performed by the helper but inherited the disabling of test tenant. It seems more prudent for each test to be explicit about the kind of testing knobs and settings it needs. Informs: #76378. Epic: CRDB-18499. Release note: None 107889: concurrency: enable joint claims on the request scan path r=nvanbenschoten a=arulajmani This patch addresses a TODO that was blocking joint claims on a request's scan path and adds a test to show joint claims work. In particular, previously the lock mode of a queued request was hardcoded to use locking strength `lock.Intent`. We now use the correct lock mode by snapshotting an immutable copy of the request's lock strength on to the queuedGuard when inserting a it into a lock's wait queue. Informs #102275 Release note: None 108052: changefeedccl: Release allocation when skipping events r=miretskiy a=miretskiy The changefeed (or KV feed to be precise) may skip some events when "scan boundary" is reached. Scan boundary is a timestamp when certain event occurs -- usually a schema change. But, it may also occur when the `end_time` option is set. The KV feed ignores events that have MVCC timestamp greater or equal to the scan boundary event. Unfortunately, due to a long outstanding bug, the memory allocation associated with the event would not be released when KV feed decides to skip the event. Because of this, allocated memory was "leaked" and not reclaimed. If enough additional events arrive, those leaked events may account for all of the memory budget, thus leading to inability for additional events to be added. This bug impacts any changefeeds running with the `end_time` option set. It might also impact changefeeds that observe normal schema change event, though this situation is highly unlikely(the same transaction that perform schema change had to have modified sufficient number of rows in the table to fill up all of the memory budget). Fixes #108040 Release note (enterprise change): Fix a potential "deadlock" when running changefeed with `end_time` option set. 108066: dev: set `COCKROACH_DEV_LICENSE` for `compose` r=rail a=rickystewart This test won't run without the license key set. Epic: CRDB-17171 Release note: None Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Arul Ajmani <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>

The changefeed (or KV feed to be precise) may skip some events when "scan boundary" is reached. Scan boundary is a timestamp when certain event occurs -- usually a schema change. But, it may also occur when the `end_time` option is set. The KV feed ignores events that have MVCC timestamp greater or equal to the scan boundary event. Unfortunately, due to a long outstanding bug, the memory allocation associated with the event would not be released when KV feed decides to skip the event. Because of this, allocated memory was "leaked" and not reclaimed. If enough additional events arrive, those leaked events may account for all of the memory budget, thus leading to inability for additional events to be added. This bug impacts any changefeeds running with the `end_time` option set. It might also impact changefeeds that observe normal schema change event, though this situation is highly unlikely (the same transaction that perform schema change had to have modified sufficient number of rows in the table to fill up all of the memory budget). Fixes #108040 Release note (enterprise change): Fix a potential "deadlock" when running changefeed with `end_time` option set.

The changefeed (or KV feed to be precise) may skip some events when "scan boundary" is reached. Scan boundary is a timestamp when certain event occurs -- usually a schema change. But, it may also occur when the `end_time` option is set. The KV feed ignores events that have MVCC timestamp greater or equal to the scan boundary event. Unfortunately, due to a long outstanding bug, the memory allocation associated with the event would not be released when KV feed decides to skip the event. Because of this, allocated memory was "leaked" and not reclaimed. If enough additional events arrive, those leaked events may account for all of the memory budget, thus leading to inability for additional events to be added. This bug impacts any changefeeds running with the `end_time` option set. It might also impact changefeeds that observe normal schema change event, though this situation is highly unlikely (the same transaction that perform schema change had to have modified sufficient number of rows in the table to fill up all of the memory budget). Fixes cockroachdb#108040 Release note (enterprise change): Fix a potential "deadlock" when running changefeed with `end_time` option set.

miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Aug 2, 2023

miretskiy changed the title ~~changefeedccl: Stuck changefeed~~ cdc: Stuck changefeed Aug 2, 2023

This was referenced Aug 2, 2023

roachtest: add changefeed scan benchmarks #107722

Merged

rangefeed: backpressure senders on EventChanTimeout=0 #107611

Merged

miretskiy mentioned this issue Aug 2, 2023

changefeedccl: Release allocation when skipping events #108052

Merged

craig bot closed this as completed in 74f3f34 Aug 2, 2023

blathers-crl bot mentioned this issue Aug 2, 2023

release-23.1: changefeedccl: Release allocation when skipping events #108074

Merged

miretskiy mentioned this issue Aug 2, 2023

release-22.2: changefeedccl: Release allocation when skipping events #108076

Merged

miretskiy mentioned this issue Aug 2, 2023

release-22.1: changefeedccl: Release allocation when skipping events #108077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdc: Stuck changefeed #108040

cdc: Stuck changefeed #108040

miretskiy commented Aug 2, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Aug 2, 2023

erikgrinaker commented Aug 2, 2023 •

edited

Loading

cdc: Stuck changefeed #108040

cdc: Stuck changefeed #108040

Comments

miretskiy commented Aug 2, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Aug 2, 2023

erikgrinaker commented Aug 2, 2023 • edited Loading

miretskiy commented Aug 2, 2023 •

edited by exalate-issue-sync bot

Loading

erikgrinaker commented Aug 2, 2023 •

edited

Loading