-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node failures with "snapshot intersects existing range" on v2.0.0 #28278
Comments
snapshot intersects existing range
on v2.0.4
@tschottdorf for triage |
What is happening here looks like a split that takes place in the presence of a lagging follower. The range that splits is r4955 and the lagging follower is n5. r4955 on the follower was lagging while the split happened. I'm not seeing any Raft snapshots for 4955, though, but only for 4956 (the ones you posted, and which fail, since they can only apply when the LHS has caught up to the split and resized itself). So the question is why it took so long for r4955 on n5 to catch up. We see that it needs two snapshots in relatively quick succession. We also see that the Raft log index rises by ~800 within 7 seconds, so there's some activity on that range. The primary key (and by the way, this is really bad and it has been pointed out numerous times on the reg cluster) is an anonymized query, so the inserts might be pretty large. And in fact, looking at the logs, I see gems like these (and not only one, pages and pages of them). At the same time, the compactor is doing its thing, and also the cluster is v2.0.0, which contains lots of perf critical bugs we've fixed over the last months. Please update the cluster, and ideally avoid these large primary keys (for example by truncating everything to, say, 200 chars before inserting it). After that, we can look at this cluster again - we're likely just rediscovering existing problems until then.
|
Thanks @tschottdorf. I bumped the version on my cluster builder script but mistakenly forgot to upload the new cloudformation template to s3. I will try with the actual 2.0.4 and follow up. |
I can't reproduce this on 2.0.4. Closing. |
I have a couple nodes failing that are preventing me from running distsql queries due to #15637. Looking at the logs on the failed nodes, I see errors like
and
I've attached the debug zip
distsqlfail.zip
The text was updated successfully, but these errors were encountered: