-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed [stalled rangefeed] #119333
Comments
Node 3 got stuck, and the replanner did not kick in because it made initial progress:
This happened again an hour later on node 4
|
Looks like a rangefeed on source node 3 stalled for 17 minutes with this fun new error, which was continuously logged from 12:23 to 12:40, which lines up with the 17 minute delay logged by the lag checker:
|
looks like this error was recently introduced by #118413 by @erikgrinaker . Since this looks like a rangefeed bug, I imagine this could cause a changefeed to stall. Adding a release blocker. |
hrm. While I'm pretty sure this range barrier error caused the c2c stream to stall, given the timing, one wrinkle I should note is that the c2c logging indicated that destination node 4 was lagging behind. But according to
I suspect the rangefeed in the kv server on node 3 was then passed to dest node 4 via src node 2, but I'm not totally sure. |
I think this indicates that the range was split, but the split wasn't applied to the follower. I'll have a closer look when I have a chance. |
This happened after r231 merged into r229:
The barrier is sent with the correct, post-merge bounds, but the server sees the old, pre-merge bounds and rejects the request. Why? |
The barrier request doesn't bypass the lease check or anything, so it should be evaluated on the leaseholder, which should have applied the merge already: Lines 1776 to 1782 in e494351
|
Could this be a tenant thing or something, where the bounds aren't propagated to the tenant yet? 🤔 Wouldn't immediately think so, since this runs in KV. Haven't seen this happen in the wild, just checked CC. |
All three nodes applied the merge immediately:
|
Secondary question: why do we keep trying to push this transaction? Are we not seeing the intent resolution replicated, which would remove it from tracking? Maybe because we don't emit any of the events here when the barrier fails or something. Will look into it after I fix the primary problem. |
I have no less than 3 prototype fixes, we'll need to gauge backport safety.
I think that's exactly the problem: this probably was an aborted transaction, whose txn record was GCed, but there is nothing actually causing the intents to get GCed. We would do so after sending the barrier, but we error out before that. I'll submit a PR to attempt to resolve intents even if the barrier fails, which should be a bit more robust. |
And another candidate fix: |
After discussing, we're going to go with #119512. We just have to confirm that this will reliably update the range cache first, and add another test. |
roachtest.c2c/tpcc/warehouses=1000/duration=60/cutover=30 failed with artifacts on release-23.2.1-rc @ 898cd6a363fd47bb92a03bac216f9bed0f64bc08:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=8
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-36155
The text was updated successfully, but these errors were encountered: