Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revision decreasing after panic during compaction #17780

Closed
6 tasks done
serathius opened this issue Apr 12, 2024 · 31 comments · Fixed by #18099
Closed
6 tasks done

Revision decreasing after panic during compaction #17780

serathius opened this issue Apr 12, 2024 · 31 comments · Fixed by #18099
Assignees
Labels

Comments

@serathius
Copy link
Member

serathius commented Apr 12, 2024

Bug report criteria

What happened?

Failure in https://github.com/etcd-io/etcd/actions/runs/8659974818

image

What did you expect to happen?

Revision doesn't decrease

How can we reproduce it (as minimally and precisely as possible)?

Follow https://github.com/etcd-io/etcd/tree/main/tests/robustness#re-evaluate-existing-report to validate report from https://github.com/etcd-io/etcd/actions/runs/8659974818

Run TestRobustnessExploratory/Kubernetes/LowTraffic/ClusterOfSize1 with failpoint compactBeforeSetFinishedCompact=panic()

Anything else we need to know?

TODO:

  • Try to reproduce the issue
  • Try to reproduce the issue on amd64

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@serathius serathius added type/bug priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 12, 2024
@serathius serathius changed the title [Arm64] Revision decreasing after panic during compaction on main branch [main branch][Arm64] Revision decreasing after panic during compaction on main branch Apr 12, 2024
serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
@serathius
Copy link
Member Author

Hmm, still no repro. Are we so unlucky that we hit a bit flip or broken machine?

@serathius serathius changed the title [main branch][Arm64] Revision decreasing after panic during compaction on main branch [main][Arm64] Revision decreasing after panic during compaction on main branch Apr 12, 2024
@siyuanfoundation
Copy link
Contributor

I was not able to repro it on my mac (arm64) with 500 runs and failpoint compactBeforeSetFinishedCompact=panic()

@chaochn47
Copy link
Member

Not be able to download the log

The reproduce attempts on linux amd64 machine is also not successful.

We need a persisted audit log to easily replay the traffic. Turn on LogUnaryInterceptor could be a good start.

@serathius
Copy link
Member Author

Not be able to download the log

What do you mean? I was able to download main-arm64.zip from https://github.com/etcd-io/etcd/actions/runs/8659974818 without any problem.

@serathius
Copy link
Member Author

I think we need to assume that this was a hardware issue. One last thing to confirm. Could someone check the bbolt file from the report? Would be a good sanity check that the revision decrease really happen.

@dims
Copy link
Contributor

dims commented Apr 12, 2024

@serathius i was able to download and have shared it with Chao (took a few retries .. spotty networking)

@chaochn47
Copy link
Member

chaochn47 commented Apr 12, 2024

image

Somehow I cannot download from the github UI.. Retried multiple times.

dev-dsk-chaochn-2c-a26acd76 % unzip logs_22693293820.zip
Archive:  logs_22693293820.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of logs_22693293820.zip or
        logs_22693293820.zip.zip, and cannot find logs_22693293820.zip.ZIP, period.

@chaochn47
Copy link
Member

chaochn47 commented Apr 12, 2024

@serathius Just to confirm, is it something you were looking for?

I did not observe revision number is decreased but there is some gap between them.

scheduledCompactRev was set to 301 though as expected.

dev-dsk-chaochn-2c-a26acd76 % etcd-dump-db iterate-bucket testdata/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/server-TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0/member/snap/db key --decode=true
rev={main:470 sub:0}, value=[key "/registry/pods/default/HcN6F" | val "343" | created 470 | mod 470 | ver 1]
rev={main:469 sub:0}, value=[key "/registry/pods/default/gq7WI" | val "340" | created 132 | mod 469 | ver 3]
rev={main:468 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "337" | created 65 | mod 468 | ver 6]
rev={main:467 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "333" | created 57 | mod 467 | ver 4]
rev={main:466 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "332" | created 101 | mod 466 | ver 5]
rev={main:465 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "330" | created 66 | mod 465 | ver 6]
rev={main:464 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "328" | created 177 | mod 464 | ver 4]
rev={main:463 sub:0}, value=[key "/registry/pods/default/oitZs" | val "327" | created 152 | mod 463 | ver 3]
rev={main:462 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "326" | created 42 | mod 462 | ver 6]
rev={main:461 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "325" | created 36 | mod 461 | ver 6]
rev={main:460 sub:0}, value=[key "/registry/pods/default/loies" | val "324" | created 185 | mod 460 | ver 8]
rev={main:459 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "323" | created 112 | mod 459 | ver 5]
rev={main:458 sub:0}, value=[key "/registry/pods/default/mFTeh" | val "322" | created 458 | mod 458 | ver 1]
rev={main:457 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "321" | created 43 | mod 457 | ver 5]
rev={main:456 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "320" | created 43 | mod 456 | ver 4]
rev={main:455 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "319" | created 66 | mod 455 | ver 5]
rev={main:454 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "318" | created 65 | mod 454 | ver 5]
rev={main:453 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "317" | created 36 | mod 453 | ver 5]
rev={main:452 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "316" | created 112 | mod 452 | ver 4]
rev={main:451 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "315" | created 438 | mod 451 | ver 4]
rev={main:450 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "314" | created 79 | mod 450 | ver 5]
rev={main:449 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "313" | created 36 | mod 449 | ver 4]
rev={main:448 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "312" | created 119 | mod 448 | ver 5]
rev={main:447 sub:0}, value=[key "/registry/pods/default/gq7WI" | val "311" | created 132 | mod 447 | ver 2]
rev={main:446 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "310" | created 438 | mod 446 | ver 3]
rev={main:445 sub:0}, value=[key "/registry/pods/default/loies" | val "309" | created 185 | mod 445 | ver 7]
rev={main:444 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "308" | created 101 | mod 444 | ver 4]
rev={main:443 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "307" | created 42 | mod 443 | ver 5]
rev={main:442 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "306" | created 119 | mod 442 | ver 4]
rev={main:441 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "305" | created 438 | mod 441 | ver 2]
rev={main:440 sub:0}, value=[key "/registry/pods/default/loies" | val "304" | created 185 | mod 440 | ver 6]
rev={main:439 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "303" | created 52 | mod 439 | ver 4]
rev={main:438 sub:0}, value=[key "/registry/pods/default/Kkeuu" | val "302" | created 438 | mod 438 | ver 1]
rev={main:437 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "301" | created 79 | mod 437 | ver 4]
rev={main:436 sub:0}, value=[key "/registry/pods/default/oitZs" | val "300" | created 152 | mod 436 | ver 2]
rev={main:435 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "299" | created 101 | mod 435 | ver 3]
rev={main:434 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "298" | created 36 | mod 434 | ver 3]
rev={main:433 sub:0}, value=[key "/registry/pods/default/loies" | val "297" | created 185 | mod 433 | ver 5]
rev={main:432 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "296" | created 65 | mod 432 | ver 4]
rev={main:431 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "295" | created 52 | mod 431 | ver 3]
rev={main:430 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "294" | created 42 | mod 430 | ver 4]
rev={main:429 sub:0}, value=[key "/registry/pods/default/b4liI" | val "" | created 0 | mod 0 | ver 0]
rev={main:428 sub:0}, value=[key "/registry/pods/default/RDgx0" | val "" | created 0 | mod 0 | ver 0]
rev={main:427 sub:0}, value=[key "/registry/pods/default/pTNFe" | val "" | created 0 | mod 0 | ver 0]
rev={main:426 sub:0}, value=[key "/registry/pods/default/aQ8HK" | val "" | created 0 | mod 0 | ver 0]
rev={main:425 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "" | created 0 | mod 0 | ver 0]
rev={main:424 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "" | created 0 | mod 0 | ver 0]
rev={main:423 sub:0}, value=[key "/registry/pods/default/lmO6M" | val "" | created 0 | mod 0 | ver 0]
rev={main:422 sub:0}, value=[key "/registry/pods/default/ydYO1" | val "" | created 0 | mod 0 | ver 0]
rev={main:421 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "293" | created 43 | mod 421 | ver 3]
rev={main:420 sub:0}, value=[key "/registry/pods/default/ldPwh" | val "" | created 0 | mod 0 | ver 0]
rev={main:419 sub:0}, value=[key "/registry/pods/default/63R6m" | val "" | created 0 | mod 0 | ver 0]
rev={main:418 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "" | created 0 | mod 0 | ver 0]
rev={main:417 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "" | created 0 | mod 0 | ver 0]
rev={main:416 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "" | created 0 | mod 0 | ver 0]
rev={main:415 sub:0}, value=[key "/registry/pods/default/Hz305" | val "" | created 0 | mod 0 | ver 0]
rev={main:414 sub:0}, value=[key "/registry/pods/default/MWrBJ" | val "" | created 0 | mod 0 | ver 0]
rev={main:413 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "" | created 0 | mod 0 | ver 0]
rev={main:412 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "292" | created 77 | mod 412 | ver 3]
rev={main:411 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "" | created 0 | mod 0 | ver 0]
rev={main:410 sub:0}, value=[key "/registry/pods/default/loies" | val "291" | created 185 | mod 410 | ver 4]
rev={main:409 sub:0}, value=[key "/registry/pods/default/zTZIE" | val "" | created 0 | mod 0 | ver 0]
rev={main:408 sub:0}, value=[key "/registry/pods/default/UeZqt" | val "" | created 0 | mod 0 | ver 0]
rev={main:407 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "" | created 0 | mod 0 | ver 0]
rev={main:406 sub:0}, value=[key "/registry/pods/default/65uJD" | val "" | created 0 | mod 0 | ver 0]
rev={main:405 sub:0}, value=[key "/registry/pods/default/1u7se" | val "" | created 0 | mod 0 | ver 0]
rev={main:404 sub:0}, value=[key "/registry/pods/default/PlsfG" | val "" | created 0 | mod 0 | ver 0]
rev={main:403 sub:0}, value=[key "/registry/pods/default/CwjdS" | val "" | created 0 | mod 0 | ver 0]
rev={main:402 sub:0}, value=[key "/registry/pods/default/PFh2o" | val "" | created 0 | mod 0 | ver 0]
rev={main:401 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "" | created 0 | mod 0 | ver 0]
rev={main:400 sub:0}, value=[key "/registry/pods/default/31cCM" | val "" | created 0 | mod 0 | ver 0]
rev={main:399 sub:0}, value=[key "/registry/pods/default/SF9kd" | val "" | created 0 | mod 0 | ver 0]
rev={main:398 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "" | created 0 | mod 0 | ver 0]
rev={main:397 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "" | created 0 | mod 0 | ver 0]
rev={main:396 sub:0}, value=[key "/registry/pods/default/ytwZZ" | val "" | created 0 | mod 0 | ver 0]
rev={main:395 sub:0}, value=[key "/registry/pods/default/obin7" | val "" | created 0 | mod 0 | ver 0]
rev={main:394 sub:0}, value=[key "/registry/pods/default/7a7Gt" | val "" | created 0 | mod 0 | ver 0]
rev={main:393 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "" | created 0 | mod 0 | ver 0]
rev={main:392 sub:0}, value=[key "/registry/pods/default/tzdxB" | val "" | created 0 | mod 0 | ver 0]
rev={main:391 sub:0}, value=[key "/registry/pods/default/eJsR5" | val "" | created 0 | mod 0 | ver 0]
rev={main:390 sub:0}, value=[key "/registry/pods/default/iUgY8" | val "" | created 0 | mod 0 | ver 0]
rev={main:389 sub:0}, value=[key "/registry/pods/default/MiZ8L" | val "" | created 0 | mod 0 | ver 0]
rev={main:388 sub:0}, value=[key "/registry/pods/default/ABFnT" | val "" | created 0 | mod 0 | ver 0]
rev={main:387 sub:0}, value=[key "/registry/pods/default/aI3l7" | val "" | created 0 | mod 0 | ver 0]
rev={main:386 sub:0}, value=[key "/registry/pods/default/loies" | val "290" | created 185 | mod 386 | ver 3]
rev={main:385 sub:0}, value=[key "/registry/pods/default/lthiH" | val "" | created 0 | mod 0 | ver 0]
rev={main:384 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "289" | created 52 | mod 384 | ver 2]
rev={main:383 sub:0}, value=[key "/registry/pods/default/ABFnT" | val "288" | created 107 | mod 383 | ver 2]
rev={main:382 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "287" | created 177 | mod 382 | ver 3]
rev={main:381 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "286" | created 79 | mod 381 | ver 3]
rev={main:380 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "285" | created 90 | mod 380 | ver 3]
rev={main:379 sub:0}, value=[key "/registry/pods/default/5k2u5" | val "" | created 0 | mod 0 | ver 0]
rev={main:378 sub:0}, value=[key "/registry/pods/default/b4liI" | val "283" | created 91 | mod 378 | ver 4]
rev={main:377 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "282" | created 66 | mod 377 | ver 4]
rev={main:376 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "281" | created 98 | mod 376 | ver 3]
rev={main:375 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "280" | created 82 | mod 375 | ver 3]
rev={main:374 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "279" | created 57 | mod 374 | ver 3]
rev={main:373 sub:0}, value=[key "/registry/pods/default/63R6m" | val "278" | created 93 | mod 373 | ver 3]
rev={main:372 sub:0}, value=[key "/registry/pods/default/F9XSE" | val "" | created 0 | mod 0 | ver 0]
rev={main:371 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "277" | created 42 | mod 371 | ver 3]
rev={main:370 sub:0}, value=[key "/registry/pods/default/31cCM" | val "276" | created 370 | mod 370 | ver 1]
rev={main:369 sub:0}, value=[key "/registry/pods/default/lthiH" | val "275" | created 178 | mod 369 | ver 3]
rev={main:368 sub:0}, value=[key "/registry/pods/default/PlsfG" | val "274" | created 327 | mod 368 | ver 2]
rev={main:367 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "273" | created 27 | mod 367 | ver 3]
rev={main:366 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "272" | created 79 | mod 366 | ver 2]
rev={main:365 sub:0}, value=[key "/registry/pods/default/65uJD" | val "271" | created 136 | mod 365 | ver 6]
rev={main:364 sub:0}, value=[key "/registry/pods/default/jQJaP" | val "" | created 0 | mod 0 | ver 0]
rev={main:363 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "269" | created 119 | mod 363 | ver 3]
rev={main:362 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "268" | created 101 | mod 362 | ver 2]
rev={main:361 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "267" | created 66 | mod 361 | ver 3]
rev={main:360 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "266" | created 33 | mod 360 | ver 3]
rev={main:359 sub:0}, value=[key "/registry/pods/default/obin7" | val "265" | created 141 | mod 359 | ver 3]
rev={main:358 sub:0}, value=[key "/registry/pods/default/xtr4a" | val "" | created 0 | mod 0 | ver 0]
rev={main:357 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "263" | created 189 | mod 357 | ver 3]
rev={main:356 sub:0}, value=[key "/registry/pods/default/MWrBJ" | val "262" | created 129 | mod 356 | ver 2]
rev={main:355 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "261" | created 65 | mod 355 | ver 3]
rev={main:354 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "259" | created 16 | mod 354 | ver 3]
rev={main:353 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "260" | created 121 | mod 353 | ver 4]
rev={main:352 sub:0}, value=[key "/registry/pods/default/F9XSE" | val "258" | created 20 | mod 352 | ver 2]
rev={main:351 sub:0}, value=[key "/registry/pods/default/nyn3R" | val "" | created 0 | mod 0 | ver 0]
rev={main:350 sub:0}, value=[key "/registry/pods/default/1u7se" | val "257" | created 191 | mod 350 | ver 3]
rev={main:349 sub:0}, value=[key "/registry/pods/default/b4liI" | val "256" | created 91 | mod 349 | ver 3]
rev={main:348 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "255" | created 116 | mod 348 | ver 3]
rev={main:347 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "254" | created 112 | mod 347 | ver 3]
rev={main:346 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "253" | created 42 | mod 346 | ver 2]
rev={main:345 sub:0}, value=[key "/registry/pods/default/65uJD" | val "252" | created 136 | mod 345 | ver 5]
rev={main:344 sub:0}, value=[key "/registry/pods/default/RDgx0" | val "251" | created 28 | mod 344 | ver 2]
rev={main:343 sub:0}, value=[key "/registry/pods/default/oNdAT" | val "" | created 0 | mod 0 | ver 0]
rev={main:342 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "249" | created 65 | mod 342 | ver 2]
rev={main:341 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "248" | created 16 | mod 341 | ver 2]
rev={main:340 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "247" | created 27 | mod 340 | ver 2]
rev={main:339 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "246" | created 140 | mod 339 | ver 3]
rev={main:338 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "245" | created 90 | mod 338 | ver 2]
rev={main:337 sub:0}, value=[key "/registry/pods/default/1u7se" | val "244" | created 191 | mod 337 | ver 2]
rev={main:336 sub:0}, value=[key "/registry/pods/default/7vXeA" | val "" | created 0 | mod 0 | ver 0]
rev={main:335 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "243" | created 121 | mod 335 | ver 3]
rev={main:334 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "242" | created 36 | mod 334 | ver 2]
rev={main:333 sub:0}, value=[key "/registry/pods/default/iUgY8" | val "240" | created 63 | mod 333 | ver 2]
rev={main:332 sub:0}, value=[key "/registry/pods/default/65uJD" | val "241" | created 136 | mod 332 | ver 4]
rev={main:331 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "239" | created 116 | mod 331 | ver 2]
rev={main:330 sub:0}, value=[key "/registry/pods/default/ldPwh" | val "238" | created 106 | mod 330 | ver 2]
rev={main:329 sub:0}, value=[key "/registry/pods/default/tzdxB" | val "237" | created 47 | mod 329 | ver 2]
rev={main:328 sub:0}, value=[key "/registry/pods/default/WVd2B" | val "" | created 0 | mod 0 | ver 0]
rev={main:327 sub:0}, value=[key "/registry/pods/default/PlsfG" | val "236" | created 327 | mod 327 | ver 1]
rev={main:326 sub:0}, value=[key "/registry/pods/default/loies" | val "234" | created 185 | mod 326 | ver 2]
rev={main:325 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "233" | created 43 | mod 325 | ver 2]
rev={main:324 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "231" | created 189 | mod 324 | ver 2]
rev={main:323 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "230" | created 57 | mod 323 | ver 2]
rev={main:322 sub:0}, value=[key "/registry/pods/default/JtjI4" | val "" | created 0 | mod 0 | ver 0]
rev={main:321 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "229" | created 66 | mod 321 | ver 2]
rev={main:320 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "228" | created 98 | mod 320 | ver 2]
rev={main:319 sub:0}, value=[key "/registry/pods/default/b4liI" | val "227" | created 91 | mod 319 | ver 2]
rev={main:318 sub:0}, value=[key "/registry/pods/default/zTZIE" | val "226" | created 138 | mod 318 | ver 2]
rev={main:317 sub:0}, value=[key "/registry/pods/default/63R6m" | val "225" | created 93 | mod 317 | ver 2]
rev={main:316 sub:0}, value=[key "/registry/pods/default/eJsR5" | val "224" | created 78 | mod 316 | ver 2]
rev={main:315 sub:0}, value=[key "/registry/pods/default/h8XO2" | val "" | created 0 | mod 0 | ver 0]
rev={main:314 sub:0}, value=[key "/registry/pods/default/aI3l7" | val "223" | created 256 | mod 314 | ver 2]
rev={main:313 sub:0}, value=[key "/registry/pods/default/lthiH" | val "222" | created 178 | mod 313 | ver 2]
rev={main:312 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "221" | created 33 | mod 312 | ver 2]
rev={main:311 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "219" | created 177 | mod 311 | ver 2]
rev={main:310 sub:0}, value=[key "/registry/pods/default/sEkjd" | val "" | created 0 | mod 0 | ver 0]
rev={main:309 sub:0}, value=[key "/registry/pods/default/65uJD" | val "218" | created 136 | mod 309 | ver 3]
rev={main:308 sub:0}, value=[key "/registry/pods/default/aQ8HK" | val "217" | created 83 | mod 308 | ver 2]
rev={main:307 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "216" | created 119 | mod 307 | ver 2]
rev={main:306 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "215" | created 251 | mod 306 | ver 3]
rev={main:305 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "213" | created 82 | mod 305 | ver 2]
rev={main:304 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "212" | created 121 | mod 304 | ver 2]
rev={main:303 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "210" | created 77 | mod 303 | ver 2]
rev={main:302 sub:0}, value=[key "/registry/pods/default/PFh2o" | val "211" | created 55 | mod 302 | ver 2]
rev={main:301 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "207" | created 112 | mod 301 | ver 2]
rev={main:300 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "209" | created 140 | mod 300 | ver 2]
rev={main:299 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "208" | created 251 | mod 299 | ver 2]
rev={main:298 sub:0}, value=[key "/registry/pods/default/vZJa5" | val "" | created 0 | mod 0 | ver 0]
rev={main:297 sub:0}, value=[key "/registry/pods/default/obin7" | val "200" | created 141 | mod 297 | ver 2]
rev={main:262 sub:0}, value=[key "/registry/pods/default/65uJD" | val "199" | created 136 | mod 262 | ver 2]
rev={main:256 sub:0}, value=[key "/registry/pods/default/aI3l7" | val "198" | created 256 | mod 256 | ver 1]
rev={main:251 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "197" | created 251 | mod 251 | ver 1]
rev={main:191 sub:0}, value=[key "/registry/pods/default/1u7se" | val "196" | created 191 | mod 191 | ver 1]
rev={main:190 sub:0}, value=[key "/registry/pods/default/7vXeA" | val "195" | created 190 | mod 190 | ver 1]
rev={main:189 sub:0}, value=[key "/registry/pods/default/DCo9K" | val "194" | created 189 | mod 189 | ver 1]
rev={main:187 sub:0}, value=[key "/registry/pods/default/oNdAT" | val "193" | created 187 | mod 187 | ver 1]
rev={main:185 sub:0}, value=[key "/registry/pods/default/loies" | val "189" | created 185 | mod 185 | ver 1]
rev={main:180 sub:0}, value=[key "/registry/pods/default/lmO6M" | val "183" | created 180 | mod 180 | ver 1]
rev={main:179 sub:0}, value=[key "/registry/pods/default/h8XO2" | val "182" | created 179 | mod 179 | ver 1]
rev={main:178 sub:0}, value=[key "/registry/pods/default/lthiH" | val "181" | created 178 | mod 178 | ver 1]
rev={main:177 sub:0}, value=[key "/registry/pods/default/wzDjI" | val "180" | created 177 | mod 177 | ver 1]
rev={main:171 sub:0}, value=[key "/registry/pods/default/Hz305" | val "173" | created 171 | mod 171 | ver 1]
rev={main:165 sub:0}, value=[key "/registry/pods/default/UeZqt" | val "167" | created 165 | mod 165 | ver 1]
rev={main:152 sub:0}, value=[key "/registry/pods/default/oitZs" | val "153" | created 152 | mod 152 | ver 1]
rev={main:143 sub:0}, value=[key "/registry/pods/default/sEkjd" | val "144" | created 143 | mod 143 | ver 1]
rev={main:140 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "140" | created 140 | mod 140 | ver 1]
rev={main:139 sub:0}, value=[key "/registry/pods/default/7a7Gt" | val "139" | created 139 | mod 139 | ver 1]
rev={main:138 sub:0}, value=[key "/registry/pods/default/zTZIE" | val "138" | created 138 | mod 138 | ver 1]
rev={main:132 sub:0}, value=[key "/registry/pods/default/gq7WI" | val "131" | created 132 | mod 132 | ver 1]
rev={main:130 sub:0}, value=[key "/registry/pods/default/ydYO1" | val "128" | created 130 | mod 130 | ver 1]
rev={main:129 sub:0}, value=[key "/registry/pods/default/MWrBJ" | val "127" | created 129 | mod 129 | ver 1]
rev={main:125 sub:0}, value=[key "/registry/pods/default/xtr4a" | val "117" | created 125 | mod 125 | ver 1]
rev={main:121 sub:0}, value=[key "/registry/pods/default/jwsRC" | val "123" | created 121 | mod 121 | ver 1]
rev={main:119 sub:0}, value=[key "/registry/pods/default/5u0i1" | val "122" | created 119 | mod 119 | ver 1]
rev={main:116 sub:0}, value=[key "/registry/pods/default/bjk4r" | val "114" | created 116 | mod 116 | ver 1]
rev={main:114 sub:0}, value=[key "/registry/pods/default/jQJaP" | val "112" | created 114 | mod 114 | ver 1]
rev={main:112 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "110" | created 112 | mod 112 | ver 1]
rev={main:107 sub:0}, value=[key "/registry/pods/default/ABFnT" | val "105" | created 107 | mod 107 | ver 1]
rev={main:106 sub:0}, value=[key "/registry/pods/default/ldPwh" | val "104" | created 106 | mod 106 | ver 1]
rev={main:102 sub:0}, value=[key "/registry/pods/default/SF9kd" | val "99" | created 102 | mod 102 | ver 1]
rev={main:101 sub:0}, value=[key "/registry/pods/default/nrz0G" | val "98" | created 101 | mod 101 | ver 1]
rev={main:100 sub:0}, value=[key "/registry/pods/default/MiZ8L" | val "97" | created 100 | mod 100 | ver 1]
rev={main:98 sub:0}, value=[key "/registry/pods/default/1bLqv" | val "95" | created 98 | mod 98 | ver 1]
rev={main:94 sub:0}, value=[key "/registry/pods/default/vZJa5" | val "91" | created 94 | mod 94 | ver 1]
rev={main:93 sub:0}, value=[key "/registry/pods/default/63R6m" | val "86" | created 93 | mod 93 | ver 1]
rev={main:91 sub:0}, value=[key "/registry/pods/default/b4liI" | val "87" | created 91 | mod 91 | ver 1]
rev={main:90 sub:0}, value=[key "/registry/pods/default/jy2Ea" | val "89" | created 90 | mod 90 | ver 1]
rev={main:87 sub:0}, value=[key "/registry/pods/default/JtjI4" | val "84" | created 87 | mod 87 | ver 1]
rev={main:83 sub:0}, value=[key "/registry/pods/default/aQ8HK" | val "80" | created 83 | mod 83 | ver 1]
rev={main:82 sub:0}, value=[key "/registry/pods/default/rU6I5" | val "79" | created 82 | mod 82 | ver 1]
rev={main:79 sub:0}, value=[key "/registry/pods/default/EpsWz" | val "76" | created 79 | mod 79 | ver 1]
rev={main:78 sub:0}, value=[key "/registry/pods/default/eJsR5" | val "75" | created 78 | mod 78 | ver 1]
rev={main:77 sub:0}, value=[key "/registry/pods/default/M9uIW" | val "73" | created 77 | mod 77 | ver 1]
rev={main:75 sub:0}, value=[key "/registry/pods/default/pTNFe" | val "71" | created 75 | mod 75 | ver 1]
rev={main:72 sub:0}, value=[key "/registry/pods/default/CwjdS" | val "68" | created 72 | mod 72 | ver 1]
rev={main:66 sub:0}, value=[key "/registry/pods/default/Y9W7o" | val "62" | created 66 | mod 66 | ver 1]
rev={main:65 sub:0}, value=[key "/registry/pods/default/Q4wBI" | val "61" | created 65 | mod 65 | ver 1]
rev={main:63 sub:0}, value=[key "/registry/pods/default/iUgY8" | val "59" | created 63 | mod 63 | ver 1]
rev={main:57 sub:0}, value=[key "/registry/pods/default/UhvUo" | val "54" | created 57 | mod 57 | ver 1]
rev={main:55 sub:0}, value=[key "/registry/pods/default/PFh2o" | val "52" | created 55 | mod 55 | ver 1]
rev={main:52 sub:0}, value=[key "/registry/pods/default/cGyPD" | val "50" | created 52 | mod 52 | ver 1]
rev={main:47 sub:0}, value=[key "/registry/pods/default/tzdxB" | val "45" | created 47 | mod 47 | ver 1]
rev={main:43 sub:0}, value=[key "/registry/pods/default/w3EyG" | val "42" | created 43 | mod 43 | ver 1]
rev={main:42 sub:0}, value=[key "/registry/pods/default/4JCLw" | val "41" | created 42 | mod 42 | ver 1]
rev={main:40 sub:0}, value=[key "/registry/pods/default/nyn3R" | val "39" | created 14 | mod 40 | ver 2]
rev={main:38 sub:0}, value=[key "/registry/pods/default/5k2u5" | val "16" | created 38 | mod 38 | ver 1]
rev={main:37 sub:0}, value=[key "/registry/pods/default/ytwZZ" | val "37" | created 37 | mod 37 | ver 1]
rev={main:36 sub:0}, value=[key "/registry/pods/default/OQVJw" | val "36" | created 36 | mod 36 | ver 1]
rev={main:33 sub:0}, value=[key "/registry/pods/default/gm4gQ" | val "33" | created 33 | mod 33 | ver 1]
rev={main:28 sub:0}, value=[key "/registry/pods/default/RDgx0" | val "29" | created 28 | mod 28 | ver 1]
rev={main:27 sub:0}, value=[key "/registry/pods/default/TYdSY" | val "28" | created 27 | mod 27 | ver 1]
rev={main:23 sub:0}, value=[key "/registry/pods/default/WVd2B" | val "24" | created 23 | mod 23 | ver 1]
rev={main:20 sub:0}, value=[key "/registry/pods/default/F9XSE" | val "21" | created 20 | mod 20 | ver 1]
rev={main:16 sub:0}, value=[key "/registry/pods/default/Q9JZ7" | val "17" | created 16 | mod 16 | ver 1]
dev-dsk-chaochn-2c-a26acd76 % etcd-dump-db iterate-bucket testdata/TestRobustnessExploratory_Kubernetes_LowTraffic_ClusterOfSize1/server-TestRobustnessExploratoryKubernetesLowTrafficClusterOfSize1-test-0/member/snap/db meta --decode=true
key="term", value=3
key="storageVersion", value="3.6.0"
key="scheduledCompactRev", value={301 0}
key="consistent_index", value=539
key="confState", value="{\"voters\":[14578408409545168728],\"auto_leave\":false}"

@dims
Copy link
Contributor

dims commented Apr 12, 2024

spoke too soon, multiple laptops, different browsers, same issue (zip corrupted)

@jmhbnz jmhbnz added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Apr 12, 2024
@serathius
Copy link
Member Author

serathius commented Apr 12, 2024

Hmm, last 4 operations before crash were deletes. Defrag removes revisions in Key bucket for keys that were deleted. Etcd infers the last revision based on Key bucket. After restart etcd went back as far as the last put operation. I have a bad feeling about this. cc @ahrtr

@serathius
Copy link
Member Author

serathius commented Apr 12, 2024

Will try to increase number of deletes to reproduce a similar case. Looking at the results from couple of runs, the deletes are pretty rare in Kubernetes traffic, possibly request type picker is broken.

serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
serathius added a commit to serathius/etcd that referenced this issue Apr 12, 2024
Signed-off-by: Marek Siarkowicz <[email protected]>
@chaochn47
Copy link
Member

chaochn47 commented Apr 13, 2024

Hmm, last 4 operations before crash were deletes. Defrag removes revisions in Key bucket for keys that were deleted. Etcd infers the last revision based on Key bucket. After restart etcd went back as far as the last put operation. I have a bad feeling about this. cc @ahrtr

Trying to interpret the statement, the difference between the db file key bucket layout and the recorded response is due to compact & defrag after resuming from the panic, right?

But it does not make sense deletion returns response with revision 298 with key /regstry/pods/default/o95Cz but in the db file 298 corresponds to /registry/pods/default/vZJa5. Either the db file is wrong or the etcd responds with incorrect revision number to client.

297 is correct and it is reflected as put in the db file.

rev={main:301 sub:0}, value=[key "/registry/pods/default/HhKkd" | val "207" | created 112 | mod 301 | ver 2]
rev={main:300 sub:0}, value=[key "/registry/pods/default/E7Zuy" | val "209" | created 140 | mod 300 | ver 2]
rev={main:299 sub:0}, value=[key "/registry/pods/default/5Dtb4" | val "208" | created 251 | mod 299 | ver 2]
rev={main:298 sub:0}, value=[key "/registry/pods/default/vZJa5" | val "" | created 0 | mod 0 | ver 0]
rev={main:297 sub:0}, value=[key "/registry/pods/default/obin7" | val "200" | created 141 | mod 297 | ver 2]
if(mod_rev(/registry/pods/default/NYpLu)==30).
  then(delete("/registry/pods/default/NYpLu")).
  else(get("/registry/pods/default/NYpLu")) -> success(deleted: 1), rev:296

if(mod_rev(/registry/pods/default/obin7)==141).
  then(put("/registry/pods/default/obin7", "200")).
  else(get("/registry/pods/default/obin7")) -> success(ok), rev: 297

if(mod_rev(/regstry/pods/default/o95Cz)==133).
  then(delete("/registry/pods/default/o95Cz")).
  else(get("/registry/pods/default/o95Cz")) -> success(deleted: 1), rev: 298

if(mod_rev(/registry/pods/default/OWSx2)==137).
  then(delete("/registry/pods/default/OWSx2")).
  else(get("/registry/pods/default/OWSx2")) -> success(deleted: 1), rev: 299

if(mod_rev(/registry/pods/default/V8bKj)==53).
  then(delete("/registry/pods/default/V8bKj")).
  else(get("/registry/pods/default/V8bKj")) -> success(deleted: 1), rev: 300

if(mod_rev(/registry/pods/default/Z68rD)==156).
  then(delete("/registry/pods/default/Z68rD")).
  else(get("/registry/pods/default/Z68rD")) -> success(deleted: 1), rev: 301

Compact -> `scheduledCompactRev` was set to `301`

@ahrtr
Copy link
Member

ahrtr commented Apr 13, 2024

Hmm, last 4 operations before crash were deletes. Defrag removes revisions in Key bucket for keys that were deleted. Etcd infers the last revision based on Key bucket. After restart etcd went back as far as the last put operation. I have a bad feeling about this. cc @ahrtr

The workflow doesn't have any problem. Compact + defragmentation won't remove the last revision, even it's a tombstone revision.

But it does not make sense deletion returns response with revision 298 with key /regstry/pods/default/o95Cz but in the db file 298 corresponds to /registry/pods/default/vZJa5. Either the db file is wrong or the etcd responds with incorrect revision number to client.

The bbolt db file should be correct.

It looks like that the report isn't consistent with regards to revision 298. So it's a bug of the test itself?

  • In client-1/watch.json
{"Events":[{"Type":"delete-operation","Key":"/registry/pods/default/o95Cz","Value":{"Value":"","Hash":0},"Revision":298,"IsCreate":false,"PrevValue":null}],"IsProgressNotify":false,"Revision":298,"Time":919365280},
  • In client-3/watch.json
{"Request":{"Key":"/registry/pods/","Revision":298,"WithPrefix":true,"WithProgressNotify":true,"WithPrevKV":true},"Responses":[{"Events":[{"Type":"delete-operation","Key":"/registry/pods/default/vZJa5","Value":{"Value":"","Hash":0},"Revision":298,"IsCreate":false,"PrevValue":{"Value":{"Value":"91","Hash":0},"ModR    evision":94}}],"IsProgressNotify":false,"Revision":298,"Time":2073785080}
  • In client-4/watch.json
"Request":{"Key":"/registry/pods/","Revision":298,"WithPrefix":true,"WithProgressNotify":true,"WithPrevKV":true},"Responses":[{"Events":[{"Type":"delete-operation","Key":"/registry/pods/default/vZJa5","Value":{"Value":"","Hash":0},"Revision":298,"IsCreate":false,"PrevValue":{"Value":{"Value":"91","Hash":0},"ModRevision":94}}

@serathius
Copy link
Member Author

With #17815 this issue have been confirmed and reproduced on all supported release branches.

@ahrtr
Copy link
Member

ahrtr commented Apr 17, 2024

Root cause

Based on my discussion with @fuweid today, for anyone reference on the root cause of this issue, see below summary.

  • When latest revision is a tombstone revision, compacting the latest revision will remove the latest revision. It causes the revision decreases in bbolt data file.
  • If etcd crashes right before it persists the finished compaction revision. Then the finished compact revision isn't persisted to bbolt
    UnsafeSetFinishedCompact(tx, compactMainRev)
  • When etcd gets started again, it loads all keys and gets the latest revision (already decreased revision) from bbolt. Usually it isn't a problem, because etcd will correct the revision based on the finished compaction revision. It's exactly the reason why previous I did not reproduce it when I raised the comment Revision decreasing after panic during compaction #17780 (comment). But if the finished compact revision hasn't been persisted, then there is no way for etcd to correct the revision.
    s.compactMainRev = finishedCompact

    if s.currentRev < s.compactMainRev {
    s.currentRev = s.compactMainRev
    }

Versions affected

All versions (3.4.x, 3.5.x, main) have this issue.

For single node cluster, the symptom is the revision decreases.

For multi node cluster, the symptom is not only the revision decreases, but also inconsistent revisions across the etcd cluster.

Note the key/value data is still consistent when this issue is reproduced.

Hard to reproduce

The good news is that this issue should be very hard to reproduce in production environment, because It can only be reproduced when all the following conditions are true,

  • Compact the latest revision;
  • The latest revision is a tombstone (a deletion);
  • etcd crashes after it removes the latest tombstone revision and before it persists the finishedCompactRevision. Obviously it's very small window.

Solution

One proposed solution: #17815 (comment)

Another solution is updating currentRev using the scheduledCompactRevision on bootstrap. See

Workaround

Once it's reproduced, we can use the bump-revision to manually bump the revision to make all etcd instances have consistent revision.

@serathius
Copy link
Member Author

With the issue confirmed we need to do impact assessment, check which version it affects, how often it can happen, can it cause inconsistency in multi member cluster and if it can happen in Kubernetes.

The good news is that this issue should be very hard to reproduce in production environment.

Can you provide some context why it should very hard to reproduce, so for everyone can follow? Is it just based on small window of crash vulnerability due to infrequency of compact? How that probability looks for Kubernetes?

Fact that last revision needs to be a tombstone, reduces the chances, but would be good to confirm what percentage do deletes constitute.

@ahrtr
Copy link
Member

ahrtr commented Apr 17, 2024

The good news is that this issue should be very hard to reproduce in production environment.

Can you provide some context why it should very hard to reproduce, so for everyone can follow?

Sorry for the confusion. I should be more clearer. Just updated my previous comment.

@serathius
Copy link
Member Author

serathius commented Apr 18, 2024

Sorry for the confusion. I should be more clearer. Just updated my previous comment.

Thanks, looks great!

Compact the latest revision;

This makes sense, compacting last revision is not expected behavior from most users. Kubernetes built in compaction should almost never do that. "Almost" comes from cases where there were no writes at all for 5 minutes, which is unexpected due to Kubernetes Node Lease and Leader election writing periodically.

fuweid added a commit to fuweid/etcd that referenced this issue Apr 18, 2024
fuweid added a commit to fuweid/etcd that referenced this issue Apr 18, 2024
fuweid added a commit to fuweid/etcd that referenced this issue Apr 18, 2024
fuweid added a commit to fuweid/etcd that referenced this issue Apr 19, 2024
fuweid added a commit to fuweid/etcd that referenced this issue Apr 21, 2024
@ahrtr
Copy link
Member

ahrtr commented Apr 23, 2024

@fuweid I am going to release etcd 3.4.32 together with @spzala tomorrow night, can you please #17815 to 3.5 and 3.4 ? Please feel free to let me know if you don't have bandwidth. Thanks

@fuweid
Copy link
Member

fuweid commented Apr 24, 2024

fuweid added a commit to fuweid/etcd that referenced this issue Apr 24, 2024
Signed-off-by: Wei Fu <[email protected]>
(cherry picked from commit 7173391)
Signed-off-by: Wei Fu <[email protected]>
fuweid added a commit to fuweid/etcd that referenced this issue Apr 24, 2024
Signed-off-by: Wei Fu <[email protected]>
(cherry picked from commit 7173391)
Signed-off-by: Wei Fu <[email protected]>
@ahrtr
Copy link
Member

ahrtr commented Apr 24, 2024

@fuweid Thanks. Please also update the changelog for 3.4 and 3.5.

fuweid added a commit to fuweid/etcd that referenced this issue Apr 24, 2024
Fix: Revision decreasing after panic during compaction - etcd-io#17780

Signed-off-by: Wei Fu <[email protected]>
Scotchman0 pushed a commit to Scotchman0/etcd that referenced this issue May 7, 2024
Fix: Revision decreasing after panic during compaction - etcd-io#17780

Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Will Russell <[email protected]>
@serathius
Copy link
Member Author

Keeping it open until we cover it with robustness test.

@fuweid
Copy link
Member

fuweid commented May 27, 2024

/assign

@serathius
Copy link
Member Author

Let's close it for now.

fuweid added a commit to fuweid/etcd that referenced this issue Nov 11, 2024
fuweid added a commit to fuweid/etcd that referenced this issue Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

7 participants