-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/changefeedccl: TestChangefeedNemeses failed #83882
Comments
The changefeed saw these two events out of order: A touch update on row 3
The insert of row 3
The |
Here's the relevant test log, it's in a tmp folder on teamcity so I assume it goes poof at some point. The sequence of events looks something like: ADD COLUMN |
Probably pausing my investigation here for now because if the spurious update is the only way this can happen, and the only real event it can be out of order with respect to is the insert, that's going away soon and it's not likely to break anything if those two messages appear out of order. Very bad if we got a spurious update incorrectly ordered with respect to a real update, but so far no evidence that that's possible. |
I don't think we should dismiss this whatsoever. How reproducible is this? |
Have not yet reproduced it--it didn't happen in the first 200 runs locally. |
200 is not very many. I'd spin up at least 100 cores and |
Every time this test has failed with this sort of failure it has been a real bug. |
...yup, just got this out of order insert and delete:
In both cases so far this was the cloudstorage sink and the job had recently restarted, so this could be an issue with how the test consumer is doing ordering across different process ids. I'm going to see if I can get it to happen with a different sink. |
That's fine to rule out the other sinks having a bug, but the cloudstorage sink is particularly subtle in its ordering guarantees and that test was very valuable in validating fixes to that ordering. It might be worth bisecting if you can get a repro reliably in under 5 minutes at some scale. |
I suspect that #83530 might be a potential culprit; Since nemeses test pauses/resumes jobs quite often, the above update made it so that checkpoint could be skipped, while the sink may still receive resolved timestamp flush; thats.... probably not good. This also explains very large uptick in changefeed flakes, particularly around cloukdstorage sinks because cloudstorage |
Address multiple source of flakes in changefeed tests. cockroachdb#83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as document in cockroachdb#84102 Address this issue by making cloud storage sink more permissive around it's handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since cockroachdb#82451 clarified that the rangefeed start time is exclusive. Informs cockroachdb#83882 Fixes cockroachdb#83946 Release Notes: None
84109: changefeedcc: De-flake changefeed tests. r=miretskiy a=miretskiy Address multiple source of flakes in changefeed tests. #83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as documented in #84102 Address this issue by making cloud storage sink more permissive around its handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since #82451 clarified that the rangefeed start time is exclusive. Informs #83882 Fixes #83946 Release Notes: None Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@miretskiy we're (flaky-test-fighter-team) seeing TestChangefeedNemesis timeout on 22.1 - https://teamcity.cockroachdb.com/viewLog.html?buildId=5770121&buildTypeId=Cockroach_UnitTests_BazelUnitTests&tab=buildResultsDiv. Do you think your linked PR should/could be backported? |
Probably; re-opened #80475 to keep track of this.
…On Sat, Jul 16, 2022 at 10:47 AM Aditya Maru ***@***.***> wrote:
@miretskiy <https://github.com/miretskiy> we're seeing
TestChangefeedNemesis timeout on 22.1 -
https://teamcity.cockroachdb.com/viewLog.html?buildId=5770121&buildTypeId=Cockroach_UnitTests_BazelUnitTests&tab=buildResultsDiv.
Do you think your linked PR should/could be backported?
—
Reply to this email directly, view it on GitHub
<#83882 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANA4FVFV6HYNDHLYB5WOFYDVULDOZANCNFSM52YMRIQA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you! |
Address multiple source of flakes in changefeed tests. cockroachdb#83530 made a change to ensure that changefeed do not fail when they are in the transient (e.g. pause-requested) state. Unfortunately, the PR made a mistake where even if the checkpoint could not be completed because the cangefeed is in the "pause requested" state, we would still proceed to emit resolved event. This is wrong, and the resolved event should never be emitted if we failed to checkpoint. In addition, alter changefeed can be used to add new tables to existing changefeed, with initial scan. In such cases, the newly added table will emit events as of the timestamp of "alter changefeed statement". When this happens, the semantics around resolved events are murky as document in cockroachdb#84102 Address this issue by making cloud storage sink more permissive around it's handling of resolved timestamp. When completing initial scan for newly added tables, fix an "off by 1" error when frontier was advanced to the next timestamp. This was wrong since cockroachdb#82451 clarified that the rangefeed start time is exclusive. Informs cockroachdb#83882 Fixes cockroachdb#83946 Release Notes: None Release note (<category, see below>): <what> <show> <why>
ccl/changefeedccl.TestChangefeedNemeses failed with artifacts on master @ 773f7d4445ce3e0e806b7a182adba70a0f270f19:
Parameters: |
This is a new failure. === RUN TestChangefeedNemeses/pubsub goroutine 108099465 [running]: |
See cockroachdb#83882 (comment) There are multiple ways to get a nil return value here, so it's unclear what the actual error was here--with any luck further nemesis runs will turn it up. Release note: None
#86062 is a partial fix but it's likely the underlying bug will resurface in another form so keeping this open for now. |
86051: build: publish cockroach-sql as an archive r=rickystewart a=rail Previously, cockroach-sql was published as a standalone binary on S3. This made the UX cumbersome for end-users: * we provide explanations in docs about how to extract archives, and this is not an archive * on macos and linux, the user needs to still run chmod +x on the result, and we didn't document that This PR packages the `cockroach-sql` binary as a tarball/zip file and generates its SH256 checksum. Fixes #81246 Release note: None 86062: sql: fix interface conversion panic when hydrating returns an error r=[ajwerner] a=HonoreDB See #83882 (comment) There are multiple ways to get a nil return value here, so it's unclear what the actual error was here--with any luck further nemesis runs will turn it up. Checked the immediate call sites and they all look like they can handle a nil. Release note: None Co-authored-by: Rail Aliiev <[email protected]> Co-authored-by: Aaron Zinger <[email protected]>
ccl/changefeedccl.TestChangefeedNemeses failed with artifacts on master @ 003c0360de8b64319b5f0f127b99be91dbdca8a3:
Parameters: |
Fixed by #86794 |
ccl/changefeedccl.TestChangefeedNemeses failed with artifacts on master @ 33d70998719051ee058bc9e516afa238ea7b7451:
Parameters:
TAGS=bazel,gss
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-17328
Epic CRDB-11732
The text was updated successfully, but these errors were encountered: