Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/changefeedccl: TestChangefeedNemeses failed #83882

Closed
cockroach-teamcity opened this issue Jul 6, 2022 · 18 comments
Closed

ccl/changefeedccl: TestChangefeedNemeses failed #83882

cockroach-teamcity opened this issue Jul 6, 2022 · 18 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 6, 2022

ccl/changefeedccl.TestChangefeedNemeses failed with artifacts on master @ 33d70998719051ee058bc9e516afa238ea7b7451:

=== RUN   TestChangefeedNemeses
    test_log_scope.go:79: test logs captured to: /artifacts/tmp/_tmp/a77002d7c9453d7cd2d382f907780e13/logTestChangefeedNemeses755396176
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestChangefeedNemeses
    nemeses_test.go:59: -- test log scope end --
--- FAIL: TestChangefeedNemeses (4.76s)
=== RUN   TestChangefeedNemeses/cloudstorage
    helpers_test.go:678: making server as system tenant
    helpers_test.go:765: making cloudstorage feed factory
    nemeses_test.go:40: topic foo partition : saw new row timestamp 1657085602994720146.0000000000 after 1657085603125890405.0000000000 was seen
    nemeses_test.go:40: fingerprints did not match at 1657085602915039067.0000000000: EMPTY vs -4371770392475824696
    nemeses_test.go:40: fingerprints did not match at 1657085602936527623.0000000000: EMPTY vs -4371770392475824696
    nemeses_test.go:40: fingerprints did not match at 1657085602994720145.2147483647: EMPTY vs -4371770392475824696
    nemeses_test.go:40: fingerprints did not match at 1657085602994720146.0000000000: -8696398075016565150 vs 4901158821878539178
    nemeses_test.go:40: fingerprints did not match at 1657085603018493714.0000000000: -8696398075016565150 vs 4901158821878539178
    nemeses_test.go:40: fingerprints did not match at 1657085603125890404.2147483647: -8696398075016565150 vs 4901158821878539178
    nemeses_test.go:40: fingerprints did not match at 1657085603125890405.0000000000: -8696398075016565150 vs 4901158821878539178
    nemeses_test.go:40: fingerprints did not match at 1657085603137258302.2147483647: -8696398075016565150 vs 4901158821878539178
    nemeses_test.go:40: fingerprints did not match at 1657085603137258302.2147483647: -8696398075016565150 vs 4901158821878539178
    nemeses_test.go:40: fingerprints did not match at 1657085603137258303.0000000000: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085603538971071.0000000000: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085603739546738.0000000000: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085603939901732.0000000000: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085604056173068.2147483647: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085604056173069.0000000000: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085604061878208.2147483647: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085604061878208.2147483647: 3202056503709913342 vs -1208191130819928778
    nemeses_test.go:40: fingerprints did not match at 1657085604061878209.0000000000: -2441903359552905246 vs 2110152103965790762
    nemeses_test.go:40: fingerprints did not match at 1657085604341564517.0000000000: -2441903359552905246 vs 2110152103965790762
    --- FAIL: TestChangefeedNemeses/cloudstorage (4.68s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/cdc

This test on roachdash | Improve this report!

Jira issue: CRDB-17328

Epic CRDB-11732

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 6, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 6, 2022
@blathers-crl blathers-crl bot added the T-cdc label Jul 6, 2022
@HonoreDB HonoreDB self-assigned this Jul 6, 2022
@HonoreDB
Copy link
Contributor

HonoreDB commented Jul 6, 2022

The changefeed saw these two events out of order:

A touch update on row 3

I220706 05:33:23.198145 728268 ccl/changefeedccl/cdctest/nemeses.go:752 ⋮ [-] 172  ‹[3]›->‹{"after": {"id": 3, "ts": "1657085602994720146.0000000000"}, "before": {"id": 3, "ts": "1657085602994720146.0000000000"}, "updated": "1657085603125890405.0000000000"}

The insert of row 3

I220706 05:33:23.988561 728268 ccl/changefeedccl/cdctest/nemeses.go:752 ⋮ [-] 180  ‹[3]›->‹{"after": {"id": 3, "ts": "1657085602994720146.0000000000"}, "before": null, "updated": "1657085602994720146.0000000000"}›

The updated timestamp for the insert is earlier than the updated timestamp for the touch-update: 1657085602994720146.0000000000 < 1657085603125890405.0000000000. So the updated timestamps are consistent, it's the order in which they appeared in the feed which is wrong.

@HonoreDB
Copy link
Contributor

HonoreDB commented Jul 6, 2022

Here's the relevant test log, it's in a tmp folder on teamcity so I assume it goes poof at some point. The sequence of events looks something like:

ADD COLUMN
PAUSE CHANGEFEED
RESUME CHANGEFEED
OPEN TRANSACTION, UPSERT ROW 3
jobs says it's resuming the changefeed now
COMMIT
(lots of other activity)
backfill from the add column finishes, schema change marked as complete
changefeed sees a spurious update on row 3 as a result of the backfill due to this issue
RANGE SPLIT
Changefeed sees the insert for row 3

@HonoreDB
Copy link
Contributor

HonoreDB commented Jul 6, 2022

Probably pausing my investigation here for now because if the spurious update is the only way this can happen, and the only real event it can be out of order with respect to is the insert, that's going away soon and it's not likely to break anything if those two messages appear out of order. Very bad if we got a spurious update incorrectly ordered with respect to a real update, but so far no evidence that that's possible.

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

I don't think we should dismiss this whatsoever. How reproducible is this?

@HonoreDB
Copy link
Contributor

HonoreDB commented Jul 6, 2022

Have not yet reproduced it--it didn't happen in the first 200 runs locally.

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

200 is not very many. I'd spin up at least 100 cores and roachprod-stress it

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

Every time this test has failed with this sort of failure it has been a real bug.

@HonoreDB
Copy link
Contributor

HonoreDB commented Jul 6, 2022

...yup, just got this out of order insert and delete:

I220706 20:04:23.177984 100 ccl/changefeedccl/cdctest/nemeses.go:752 ⋮ [-] 259  ‹[2]›->‹{"after": null, "before": {"id": 2, "ts": "1657137862868940000.0000000000"}, "updated": "1657137862883989000.0000000000"}›
I220706 20:04:23.367020 100 ccl/changefeedccl/cdctest/nemeses.go:752 ⋮ [-] 291  ‹[2]›->‹{"after": {"id": 2, "ts": "1657137862868940000.0000000000"}, "before": null, "updated": "1657137862868940000.0000000000"}›

In both cases so far this was the cloudstorage sink and the job had recently restarted, so this could be an issue with how the test consumer is doing ordering across different process ids. I'm going to see if I can get it to happen with a different sink.

@ajwerner
Copy link
Contributor

ajwerner commented Jul 6, 2022

That's fine to rule out the other sinks having a bug, but the cloudstorage sink is particularly subtle in its ordering guarantees and that test was very valuable in validating fixes to that ordering. It might be worth bisecting if you can get a repro reliably in under 5 minutes at some scale.

@miretskiy
Copy link
Contributor

I suspect that #83530 might be a potential culprit;

Since nemeses test pauses/resumes jobs quite often, the above update made it so that checkpoint could be skipped, while the sink may still receive resolved timestamp flush; thats.... probably not good.

This also explains very large uptick in changefeed flakes, particularly around cloukdstorage sinks because cloudstorage
re-reads entire directory on every flush and if we emit a resolved timestamp file that we should not have emitted, then
any "previous" event files are ignored.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jul 8, 2022
Address multiple source of flakes in changefeed tests.

cockroachdb#83530 made a change
to ensure that changefeed do not fail when they are in the transient
(e.g. pause-requested) state.  Unfortunately, the PR made a mistake
where even if the checkpoint could not be completed because
the cangefeed is in the "pause requested" state, we would still
proceed to emit resolved event.  This is wrong, and the resolved
event should never be emitted if we failed to checkpoint.

In addition, alter changefeed can be used to add new tables to existing
changefeed, with initial scan.  In such cases, the newly added table
will emit events as of the timestamp of "alter changefeed statement".
When this happens, the semantics around resolved events are murky
as document in cockroachdb#84102
Address this issue by making cloud storage sink more permissive around
it's handling of resolved timestamp.

When completing initial scan for newly added tables, fix an "off by 1"
error when frontier was advanced to the next timestamp.
This was wrong since cockroachdb#82451
clarified that the rangefeed start time is exclusive.

Informs cockroachdb#83882
Fixes cockroachdb#83946

Release Notes: None
craig bot pushed a commit that referenced this issue Jul 9, 2022
84109: changefeedcc: De-flake changefeed tests. r=miretskiy a=miretskiy

Address multiple source of flakes in changefeed tests.

#83530 made a change
to ensure that changefeed do not fail when they are in the transient
(e.g. pause-requested) state.  Unfortunately, the PR made a mistake
where even if the checkpoint could not be completed because
the cangefeed is in the "pause requested" state, we would still
proceed to emit resolved event.  This is wrong, and the resolved
event should never be emitted if we failed to checkpoint.

In addition, alter changefeed can be used to add new tables to existing
changefeed, with initial scan.  In such cases, the newly added table
will emit events as of the timestamp of "alter changefeed statement".
When this happens, the semantics around resolved events are murky
as documented in #84102
Address this issue by making cloud storage sink more permissive around
its handling of resolved timestamp.

When completing initial scan for newly added tables, fix an "off by 1"
error when frontier was advanced to the next timestamp.
This was wrong since #82451
clarified that the rangefeed start time is exclusive.

Informs #83882
Fixes #83946

Release Notes: None

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@adityamaru
Copy link
Contributor

adityamaru commented Jul 16, 2022

@miretskiy we're (flaky-test-fighter-team) seeing TestChangefeedNemesis timeout on 22.1 - https://teamcity.cockroachdb.com/viewLog.html?buildId=5770121&buildTypeId=Cockroach_UnitTests_BazelUnitTests&tab=buildResultsDiv. Do you think your linked PR should/could be backported?

@miretskiy
Copy link
Contributor

miretskiy commented Jul 16, 2022 via email

@adityamaru
Copy link
Contributor

Thank you!

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Jul 27, 2022
Address multiple source of flakes in changefeed tests.

cockroachdb#83530 made a change
to ensure that changefeed do not fail when they are in the transient
(e.g. pause-requested) state.  Unfortunately, the PR made a mistake
where even if the checkpoint could not be completed because
the cangefeed is in the "pause requested" state, we would still
proceed to emit resolved event.  This is wrong, and the resolved
event should never be emitted if we failed to checkpoint.

In addition, alter changefeed can be used to add new tables to existing
changefeed, with initial scan.  In such cases, the newly added table
will emit events as of the timestamp of "alter changefeed statement".
When this happens, the semantics around resolved events are murky
as document in cockroachdb#84102
Address this issue by making cloud storage sink more permissive around
it's handling of resolved timestamp.

When completing initial scan for newly added tables, fix an "off by 1"
error when frontier was advanced to the next timestamp.
This was wrong since cockroachdb#82451
clarified that the rangefeed start time is exclusive.

Informs cockroachdb#83882
Fixes cockroachdb#83946

Release Notes: None

Release note (<category, see below>): <what> <show> <why>
@cockroach-teamcity
Copy link
Member Author

ccl/changefeedccl.TestChangefeedNemeses failed with artifacts on master @ 773f7d4445ce3e0e806b7a182adba70a0f270f19:

=== RUN   TestChangefeedNemeses/pubsub
    helpers_test.go:716: making server as system tenant
    helpers_test.go:803: making pubsub feed factory

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@HonoreDB HonoreDB added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 12, 2022
@HonoreDB
Copy link
Contributor

This is a new failure.

=== RUN TestChangefeedNemeses/pubsub
helpers_test.go:716: making server as system tenant
helpers_test.go:803: making pubsub feed factory
panic: interface conversion: interface is nil, not catalog.TableDescriptor

goroutine 108099465 [running]:
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).hydrateTypesInTableDesc(0xc00b5f2ea0, {0x7374bc8, 0xc0085eed50}, 0xc018fcda20, {0x74053f0, 0xc006ed0500})
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/hydrate.go:41 +0x169
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).getTableByID(0xc00b5f2ea0, {0x7374bc8, 0xc0085eed50}, 0xc018fcda20, 0x6a, {{0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/table.go:160 +0x385
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs.(*Collection).GetImmutableTableByID(0xc00b5f2ea0, {0x7374bc8, 0xc0085eed50}, 0xc018fcda20, 0x6a, {{0x0, 0x0, 0x0, 0x0, 0x0, ...}, ...})
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/table.go:137 +0xb8
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.refreshUDT.func1({0x7374bc8, 0xc0085eed50}, 0xc018fcda20)
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent/rowfetcher_cache.go:122 +0x11d
github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1({0x7374bc8, 0xc0085eed50}, 0xc018fcda20)
github.com/cockroachdb/cockroach/pkg/kv/db.go:963 +0x4b
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec(0xc018fcda20, {0x7374bc8, 0xc0085eed50}, 0xc0126e5990)
github.com/cockroachdb/cockroach/pkg/kv/txn.go:960 +0xd4
github.com/cockroachdb/cockroach/pkg/kv.runTxn({0x7374bc8, 0xc0085eed50}, 0xc018fcda20, 0xc0126e5ae0)
github.com/cockroachdb/cockroach/pkg/kv/db.go:962 +0xa5
github.com/cockroachdb/cockroach/pkg/kv.(*DB).TxnWithAdmissionControl(0xc006ee8120, {0x7374bc8, 0xc0085eed50}, 0x0, 0x0, 0xc0126e5ae0)
github.com/cockroachdb/cockroach/pkg/kv/db.go:925 +0xc5
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn(0xc006ee8120, {0x7374bc8, 0xc0085eed50}, 0xc0126e5ae0)
github.com/cockroachdb/cockroach/pkg/kv/db.go:904 +0x4b
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.refreshUDT({0x7374bc8, 0xc0085eed50}, 0x6a, 0xc006ee8120, 0xc00b5f2ea0, {0x170a828f6fc3a17e, 0x0, 0x0})
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent/rowfetcher_cache.go:117 +0x151
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*rowFetcherCache).tableDescForKey(0xc0115a1340, {0x7374bc8, 0xc0085eed50}, {0xc010a34c08, 0x4, 0x78}, {0x170a828f6fc3a17e, 0x0, 0x0})
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent/rowfetcher_cache.go:168 +0x685
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).initForKey(0xc0115a1600, {0x7374bc8, 0xc0085eed50}, {0xc010a34c08, 0x4, 0x78}, {0x170a828f6fc3a17e, 0x0, 0x0})
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent/event.go:436 +0xde
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent.(*eventDecoder).DecodeKV(0xc0115a1600, {0x7374bc8, 0xc0085eed50}, {{0xc010a34c08, 0x4, 0x78}, {{0xc010a34c1a, 0x28, 0x66}, {0x170a828f5445fa53, ...}}}, ...)
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/cdcevent/event.go:403 +0xca
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent.func1(0xc00b5f3040, 0xc0126e69d0, {0x7374bc8, 0xc0085eed50}, {0x170a828f6fc3a17e, 0x0, 0x0})
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/event_processing.go:150 +0x30e
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*kvEventToRowConsumer).ConsumeEvent(, {, _}, {{{0xc010a34c08, 0x4, 0x78}, {{0xc010a34c1a, 0x28, 0x66}, {0x170a828f5445fa53, ...}}}, ...})
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/event_processing.go:151 +0x5e5
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*changeAggregator).tick(0xc001035800)
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:540 +0x35b
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*changeAggregator).Next(0xc001035800)
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:493 +0xcb
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl.(*changeFrontier).Next(0xc00d2b1500)
github.com/cockroachdb/cockroach/pkg/ccl/changefeedccl/changefeed_processors.go:1051

HonoreDB added a commit to HonoreDB/cockroach that referenced this issue Aug 12, 2022
See cockroachdb#83882 (comment)

There are multiple ways to get a nil return value here, so it's unclear
what the actual error was here--with any luck further nemesis runs will turn it
up.

Release note: None
@HonoreDB
Copy link
Contributor

#86062 is a partial fix but it's likely the underlying bug will resurface in another form so keeping this open for now.

craig bot pushed a commit that referenced this issue Aug 13, 2022
86051: build: publish cockroach-sql as an archive r=rickystewart a=rail

Previously, cockroach-sql was published as a standalone binary on S3.

This made the UX cumbersome for end-users:
* we provide explanations in docs about how to extract archives, and this is not an archive
* on macos and linux, the user needs to still run chmod +x on the result, and we didn't document that

This PR packages the `cockroach-sql` binary as a tarball/zip file and
generates its SH256 checksum.

Fixes #81246

Release note: None

86062: sql: fix interface conversion panic when hydrating returns an error r=[ajwerner] a=HonoreDB

See #83882 (comment)

There are multiple ways to get a nil return value here, so it's unclear
what the actual error was here--with any luck further nemesis runs will turn it
up. Checked the immediate call sites and they all look like they can handle a nil.

Release note: None

Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Aaron Zinger <[email protected]>
@cockroach-teamcity
Copy link
Member Author

ccl/changefeedccl.TestChangefeedNemeses failed with artifacts on master @ 003c0360de8b64319b5f0f127b99be91dbdca8a3:

=== RUN   TestChangefeedNemeses
    test_log_scope.go:162: test logs captured to: /artifacts/tmp/_tmp/a77002d7c9453d7cd2d382f907780e13/logTestChangefeedNemeses330743824
    test_log_scope.go:80: use -show-logs to present logs inline
=== CONT  TestChangefeedNemeses
    nemeses_test.go:59: -- test log scope end --
--- FAIL: TestChangefeedNemeses (2.97s)
=== RUN   TestChangefeedNemeses/sinkless
    helpers_test.go:716: making server as system tenant
    helpers_test.go:803: making sinkless feed factory
    nemeses_test.go:37: ERROR: context canceled (SQLSTATE XXUUU)
    --- FAIL: TestChangefeedNemeses/sinkless (2.85s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@HonoreDB
Copy link
Contributor

Fixed by #86794

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc
Projects
None yet
Development

No branches or pull requests

5 participants