Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: Stuck changefeed #108040

Closed
miretskiy opened this issue Aug 2, 2023 · 2 comments · Fixed by #108052
Closed

cdc: Stuck changefeed #108040

miretskiy opened this issue Aug 2, 2023 · 2 comments · Fixed by #108052
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Aug 2, 2023

As reported by @erikgrinaker , running experiment emitting into null:// sink deadlocked changefeed.
Relevant stacks below:

goroutine 6459467 [select, 108 minutes]:
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).Get(0xc0b481b880, {0x7881e68, 0xc01bf92b40})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:202 +0x165
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed.copyFromSourceToDestUntilTableEvent({0x7881e68, 0xc01bf92b40}, {0x7f1b0ddc1fa8?, 0xc094dcbdb8?}, {0x7f1aff8edf78, 0xc0b481b880}, 0xc0308be640?, {0x785e190?, 0xc04efa3b00?}, {0x17778b07a2024400, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed/kv_feed.go:707 +0x282
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed.(*kvFeed).runUntilTableEvent.func3({0x7881e68, 0xc01bf92b40})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed/kv_feed.go:510 +0x13f
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5


goroutine 6459469 [select, 108 minutes]:
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Acquire(0xc04b0309a0, {0x7881e68, 0xc01bf92d40}, {0x7857ba0, 0xc0b481b8a8})
	github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:281 +0x75c
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).AcquireMemory.func1(0x3cbcbcc?, 0xc0a0e29380?, {0x7881e68?, 0xc01bf92d40?})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:241 +0xeb
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).AcquireMemory(0xc0b481b880, {0x7881e68?, 0xc01bf92d40?}, 0x1a4)
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:245 +0xd0
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent.(*blockingBuffer).Add(0xc0b481b880, {0x7881e68, 0xc01bf92d40}, {0xc08dbf1410, 0x2, {0x0, 0x0, 0x0}, {0xc12aaaecadd02bce, 0x2c44208c122, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvevent/blocking_buffer.go:268 +0x265
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed.(*rangefeed).addEventsToBuffer(0xc04b030a50, {0x7881e68?, 0xc01bf92d40?})
	github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/kvfeed/physical_kv_feed.go:109 +0x4cc
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1()
	github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:168 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:72 +0xa5

Jira issue: CRDB-30296

Epic CRDB-11783

@miretskiy miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Aug 2, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 2, 2023

cc @cockroachdb/cdc

@miretskiy miretskiy changed the title changefeedccl: Stuck changefeed cdc: Stuck changefeed Aug 2, 2023
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Aug 2, 2023

Reproduces readily with the following roachtest from #107722:

cdc/scan/catchup/nodes=5/cpu=16/rows=1G/ranges=100000/protocol=rangefeed/format=json/sink=null

It splits off 100.000 ranges, ingests 1 billion rows, then creates the following changefeed:

CREATE CHANGEFEED FOR kv.kv INTO 'null://' WITH cursor = <before_inserts>, end_time = <now>, format = 'json', min_checkpoint_frequency = '10s';

The changefeed will stop making progress after a few minutes, with catchup scans blocked sending to the stream, because the stream processors are blocked on quota acquisition as shown above. This happens across all nodes.

craig bot pushed a commit that referenced this issue Aug 2, 2023
104228: dev: replace `stress` with `--runs_per_test` r=rail a=rickystewart

`--runs_per_test` does the same thing that `stress` does (runs a test
many times to weed out flakes), but better, in that:

1. Better Bazel support -- we had to hack Bazel support into `stress`
   to keep it working
2. Bazel gives us statistics and control over how the tests are
   scheduled in a way that is standardized as opposed to the ad-hoc
   arguments one can pass to `stress`
3. Bazel can collect logs and artifacts for different test runs and
   expose them to the user in a way that `stress` cannot
4. Drop-in support for remote execution when we have access to it,
   obviating the need for `roachprod-stress`

For now, the default implementation of `dev test --stress` is to
run the test 1,000 times, stopping the build if any test fails.
This is meant to replicate how people use `stress` (i.e., just start it
running and stop if there are any failures). The 1,000 figure is
arbitrary and meant to just be a "very high number" of times to run unit
tests. The default figure of 1,000 can be adjusted with `--count`. Also
update documentation with some new recipes and add extra logging to
warn about the change in behavior.

Closes #102879

Epic: none
Release note: None

107765: sql: remove calls to CreateTestServerParams r=yuzefovich a=yuzefovich

This PR removes calls to `tests.CreateTestServerParams` outside of `sql_test` tests as well as does some miscellaneous cleanups along the way. The rationale for doing this is that vast majority of callers didn't actually need the setup performed by the helper but inherited the disabling of test tenant. It seems more prudent for each test to be explicit about the kind of testing knobs and settings it needs.

Informs: #76378.
Epic: CRDB-18499.

Release note: None

107889: concurrency: enable joint claims on the request scan path r=nvanbenschoten a=arulajmani

This patch addresses a TODO that was blocking joint claims on a request's scan path and adds a test to show joint claims work. In particular, previously the lock mode of a queued request was hardcoded to use locking strength `lock.Intent`. We now use the correct lock mode by snapshotting an immutable copy of the request's lock strength on to the queuedGuard when inserting a it into a lock's wait queue.

Informs #102275

Release note: None

108052: changefeedccl: Release allocation when skipping events r=miretskiy a=miretskiy

The changefeed (or KV feed to be precise) may skip some events when "scan boundary" is reached.
Scan boundary is a timestamp when certain event occurs -- usually a schema change.  But, it may also occur
when the `end_time` option is set.

The KV feed ignores events that have MVCC timestamp greater or equal to the scan boundary event.

Unfortunately, due to a long outstanding bug, the memory allocation associated with the event would not be released when KV feed decides to skip the event.

Because of this, allocated memory was "leaked" and not reclaimed. If enough additional events arrive, those leaked events may account for all of the memory budget, thus leading to inability for additional events to be added.

This bug impacts any changefeeds running with the `end_time` option set.  It might also impact changefeeds that observe normal schema change event, though this situation is highly unlikely(the same transaction that perform schema change had to have modified sufficient number of rows in the table to fill up all of the memory budget).

Fixes #108040

Release note (enterprise change): Fix a potential "deadlock" when running changefeed with `end_time` option set.

108066: dev: set `COCKROACH_DEV_LICENSE` for `compose` r=rail a=rickystewart

This test won't run without the license key set.

Epic: CRDB-17171
Release note: None

Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@craig craig bot closed this as completed in 74f3f34 Aug 2, 2023
blathers-crl bot pushed a commit that referenced this issue Aug 2, 2023
The changefeed (or KV feed to be precise) may skip
some events when "scan boundary" is reached.
Scan boundary is a timestamp when certain event occurs --
usually a schema change.  But, it may also occur
when the `end_time` option is set.

The KV feed ignores events that have MVCC timestamp
greater or equal to the scan boundary event.

Unfortunately, due to a long outstanding bug, the memory
allocation associated with the event would not be released
when KV feed decides to skip the event.

Because of this, allocated memory was "leaked" and not reclaimed.
If enough additional events arrive, those leaked events may
account for all of the memory budget, thus leading to inability
for additional events to be added.

This bug impacts any changefeeds running with the `end_time`
option set.  It might also impact changefeeds that observe
normal schema change event, though this situation is highly unlikely
(the same transaction that perform schema change had to have
modified sufficient number of rows in the table to fill
up all of the memory budget).

Fixes #108040

Release note (enterprise change): Fix a potential "deadlock" when
running changefeed with `end_time` option set.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 2, 2023
The changefeed (or KV feed to be precise) may skip
some events when "scan boundary" is reached.
Scan boundary is a timestamp when certain event occurs --
usually a schema change.  But, it may also occur
when the `end_time` option is set.

The KV feed ignores events that have MVCC timestamp
greater or equal to the scan boundary event.

Unfortunately, due to a long outstanding bug, the memory
allocation associated with the event would not be released
when KV feed decides to skip the event.

Because of this, allocated memory was "leaked" and not reclaimed.
If enough additional events arrive, those leaked events may
account for all of the memory budget, thus leading to inability
for additional events to be added.

This bug impacts any changefeeds running with the `end_time`
option set.  It might also impact changefeeds that observe
normal schema change event, though this situation is highly unlikely
(the same transaction that perform schema change had to have
modified sufficient number of rows in the table to fill
up all of the memory budget).

Fixes cockroachdb#108040

Release note (enterprise change): Fix a potential "deadlock" when
running changefeed with `end_time` option set.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Aug 2, 2023
The changefeed (or KV feed to be precise) may skip
some events when "scan boundary" is reached.
Scan boundary is a timestamp when certain event occurs --
usually a schema change.  But, it may also occur
when the `end_time` option is set.

The KV feed ignores events that have MVCC timestamp
greater or equal to the scan boundary event.

Unfortunately, due to a long outstanding bug, the memory
allocation associated with the event would not be released
when KV feed decides to skip the event.

Because of this, allocated memory was "leaked" and not reclaimed.
If enough additional events arrive, those leaked events may
account for all of the memory budget, thus leading to inability
for additional events to be added.

This bug impacts any changefeeds running with the `end_time`
option set.  It might also impact changefeeds that observe
normal schema change event, though this situation is highly unlikely
(the same transaction that perform schema change had to have
modified sufficient number of rows in the table to fill
up all of the memory budget).

Fixes cockroachdb#108040

Release note (enterprise change): Fix a potential "deadlock" when
running changefeed with `end_time` option set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants