-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd crash and cannot be recovered after "mvcc: range cannot find rev" #10010
Comments
@yingnanzhang666 Is there anyway that you can share the db files? |
@gyuho It's our product env. so I cannot share. Or which information do you need? I can collect for you. |
Can you provide more details? So, no matter how far you go back with previous snapshots, etcd now panics right away? |
I will provide some redact tooling #7620, so that the db files can be shared. |
After restore, it can running for several minutes, log is as below
and the last log is "mvcc: range cannot find rev (1716401,0)", then keep to restart. but panic immediately.
|
I restore the backups from the last one and go backward. The phenomenon to restore the latest several one is as my last comment. Then go back to restore other backups, got two other kinds of panic, "panic: invalid page type: 1938: 10" and "panic: page 1938 already freed", mentioned in issue description. And it will panic immediately after being bring up. Until go back to the backup which shows as below, etcd was recovered, no panic.
redact tooling is pretty cool. When can I use it? |
@yingnanzhang666 No timeline yet, but will add that feature asap. Could you keep the "first" db file that you can panic from after "snapshot restore"? So that, it can be reproducible on our side once we have redact tool? |
cc @jpbetz |
@yingnanzhang666 You've encountered #8813 which exists on 3.1.9. The fix was backported via #8902 to 3.1.11. "page 1938 already freed" type errors are the most common symptom of this corruption bug. It can happen in both HA clusters when they recover from peers and when recovering from a backup snapshot file. |
Before restore from snapshot, the cluster is running with single etcd normally, but got error "mvcc: range cannot find rev (224,0)", after etcd panic and restart with data, it cannot start up with "panic: runtime error: slice bounds out of range" when restore.
It shows the keys are not correct. the len of key is not larger than 8+1+8. |
@jpbetz For the issue fixed by this PR etcd-io/bbolt#67, whether it existed in v3.1.9? The code changed isn't existed in v3.1.9. And also I add the unit test case "func TestDB_Concurrent_WriteTo(t *testing.T)" into v3.1.9, but it can pass.
And test v3.1.11 without etcd-io/bbolt#67, the unit test shows
|
@jpbetz And also, from the comments #8813 (comment), it only impact the snapshot instead of the db, right? So suppose if it's a single etcd, and never recovered from backup, this freelist corruption issue won't occur. But actually, when it got the first error "mvcc: range cannot find rev (224,0)" and went to crash, it's a single etcd, never restart or recover since it start from scratch. meta page
root page
And the key page is the same as root page. multiple references to this page 1938. freelist page
See there're two 1977 in this freelist. I think perhaps this issue is not the same with etcd-io/bbolt#67 |
@yingnanzhang666 my understanding is that the issue exists on etcd versions prior to 3.1.11. The safest bet would be to bump up your etcd version. I’m not 100% certain why the test passed, but it’s not written to be deterministic, so maybe that’s at play? |
There are two ways we know it can happen: (1) when an etcd server becomes sufficiently out of date that it is send a new copy of the db from the leader. This happens automatically. (2) when creating a snapshot file explicitly. E.g via etcdctl So maybe you’ve encountered it in a way we haven’t seen before? Everything about the details you’ve provided strongly suggest that this is the same free list corruption issue. |
@jpbetz There's a |
I am afraid the issue persists in 3.1.18,
using
to see what happened, I added several log lines before and after the log lines I added
and reexecute
as @yingnanzhang666 stated in #10010 (comment)
I have to |
@yuchengwu, do you still have the broken db file? Could you run 'bolt check db' and verify this is the same bug as originally reported in this issue? 'bolt' is a command line utility: https://github.com/boltdb/bolt#installing |
@jingyih yes, sure
this is 'bolt_pages.txt', this is 'bolt_stats.txt' |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
etcd version: 3.1.9
kubernetes version: 1.9
etcd panic after got err msg "mvcc: range cannot find rev (224,0)" as below.
When it restart, and restore compact to a certain revision, keep to panic, got "panic: runtime error: slice bounds out of range"
When restore etcd from some previous backups, got other two kinds of panic, "panic: invalid page type: 1938: 10" and "panic: page 1938 already freed" as below.
Check snapshot. Only use the backups before that one passed bolt check, it can recover to a healthy etcd cluster. The bad one shows "page 1938: already freed"
Check the db in etcd data-dir, it shows "stack overflowpage"
The text was updated successfully, but these errors were encountered: