-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revision decreasing after panic during compaction #17780
Comments
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Created a PR with code to reproduce the issue on CI as I don't have access to arm64 machines: #17782 Running multiple concurrent jobs:
|
Signed-off-by: Marek Siarkowicz <[email protected]>
Hmm, still no repro. Are we so unlucky that we hit a bit flip or broken machine? |
I was not able to repro it on my mac (arm64) with 500 runs and failpoint |
Not be able to download the log The reproduce attempts on linux amd64 machine is also not successful. We need a persisted audit log to easily replay the traffic. Turn on LogUnaryInterceptor could be a good start. |
What do you mean? I was able to download main-arm64.zip from https://github.com/etcd-io/etcd/actions/runs/8659974818 without any problem. |
I think we need to assume that this was a hardware issue. One last thing to confirm. Could someone check the bbolt file from the report? Would be a good sanity check that the revision decrease really happen. |
@serathius i was able to download and have shared it with Chao (took a few retries .. spotty networking) |
@serathius Just to confirm, is it something you were looking for? I did not observe revision number is decreased but there is some gap between them.
|
spoke too soon, multiple laptops, different browsers, same issue (zip corrupted) |
Hmm, last 4 operations before crash were deletes. Defrag removes revisions in Key bucket for keys that were deleted. Etcd infers the last revision based on Key bucket. After restart etcd went back as far as the last put operation. I have a bad feeling about this. cc @ahrtr |
Will try to increase number of deletes to reproduce a similar case. Looking at the results from couple of runs, the deletes are pretty rare in Kubernetes traffic, possibly request type picker is broken. |
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Trying to interpret the statement, the difference between the db file key bucket layout and the recorded response is due to compact & defrag after resuming from the panic, right? But it does not make sense
|
The workflow doesn't have any problem. Compact + defragmentation won't remove the last revision, even it's a tombstone revision.
The bbolt db file should be correct. It looks like that the report isn't consistent with regards to revision 298. So it's a bug of the test itself?
|
With #17815 this issue have been confirmed and reproduced on all supported release branches. |
Root causeBased on my discussion with @fuweid today, for anyone reference on the root cause of this issue, see below summary.
Versions affectedAll versions (3.4.x, 3.5.x, main) have this issue. For single node cluster, the symptom is the revision decreases. For multi node cluster, the symptom is not only the revision decreases, but also inconsistent revisions across the etcd cluster. Note the key/value data is still consistent when this issue is reproduced. Hard to reproduceThe good news is that this issue should be very hard to reproduce in production environment, because It can only be reproduced when all the following conditions are true,
SolutionOne proposed solution: #17815 (comment) Another solution is updating currentRev using the scheduledCompactRevision on bootstrap. See
WorkaroundOnce it's reproduced, we can use the bump-revision to manually bump the revision to make all etcd instances have consistent revision. |
With the issue confirmed we need to do impact assessment, check which version it affects, how often it can happen, can it cause inconsistency in multi member cluster and if it can happen in Kubernetes.
Can you provide some context why it should very hard to reproduce, so for everyone can follow? Is it just based on small window of crash vulnerability due to infrequency of compact? How that probability looks for Kubernetes? Fact that last revision needs to be a tombstone, reduces the chances, but would be good to confirm what percentage do deletes constitute. |
Sorry for the confusion. I should be more clearer. Just updated my previous comment. |
Thanks, looks great!
This makes sense, compacting last revision is not expected behavior from most users. Kubernetes built in compaction should almost never do that. "Almost" comes from cases where there were no writes at all for 5 minutes, which is unexpected due to Kubernetes Node Lease and Leader election writing periodically. |
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Hi @ahrtr There are 3 pull requests, please take a look. Thanks |
Signed-off-by: Wei Fu <[email protected]> (cherry picked from commit 7173391) Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]> (cherry picked from commit 7173391) Signed-off-by: Wei Fu <[email protected]>
@fuweid Thanks. Please also update the changelog for 3.4 and 3.5. |
Fix: Revision decreasing after panic during compaction - etcd-io#17780 Signed-off-by: Wei Fu <[email protected]>
Fix: Revision decreasing after panic during compaction - etcd-io#17780 Signed-off-by: Wei Fu <[email protected]> Signed-off-by: Will Russell <[email protected]>
Keeping it open until we cover it with robustness test. |
/assign |
Let's close it for now. |
Closes etcd-io#17780 Signed-off-by: Wei Fu <[email protected]>
Closes etcd-io#17780 Signed-off-by: Wei Fu <[email protected]>
Bug report criteria
What happened?
Failure in https://github.com/etcd-io/etcd/actions/runs/8659974818
What did you expect to happen?
Revision doesn't decrease
How can we reproduce it (as minimally and precisely as possible)?
Follow https://github.com/etcd-io/etcd/tree/main/tests/robustness#re-evaluate-existing-report to validate report from https://github.com/etcd-io/etcd/actions/runs/8659974818
Run
TestRobustnessExploratory/Kubernetes/LowTraffic/ClusterOfSize1
with failpointcompactBeforeSetFinishedCompact=panic()
Anything else we need to know?
TODO:
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: