Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong raft messages may cause etcd panic #17081

Closed
4 tasks done
overvenus opened this issue Dec 7, 2023 · 6 comments · Fixed by #17078
Closed
4 tasks done

Wrong raft messages may cause etcd panic #17081

overvenus opened this issue Dec 7, 2023 · 6 comments · Fixed by #17078
Assignees

Comments

@overvenus
Copy link
Contributor

Bug report criteria

What happened?

Multiple etcd servers repeatedly panic with a message "tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"

The first panic occurred one month after the etcd started running. At that time, there were some network issues, leading to TCP reconnections.

What did you expect to happen?

No panic.

How can we reproduce it (as minimally and precisely as possible)?

It's hard to reproduce the panic in real world, but we have reproduced it in a unit test.

Anything else we need to know?

After investigation, we suspect that a malfunctioning switch or a similar issue may be dispatching incorrect Raft messages, leading to the panic. Currently, etcd does not check the To field in messages.

Etcd version (please run commands below)

We (tikv/pd) use embed etcd v3.4.21 85b640cee793

Etcd configuration (command line flags or environment variables)

Default settings.

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

[2023/10/30 14:48:34.402 +08:00] [WARN] [stream.go:266] ["closed TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.402 +08:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.409 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.409 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=1baed0b840546f7]
[2023/10/30 14:48:34.410 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=18cf5532a87c6527]
[2023/10/30 14:48:34.410 +08:00] [INFO] [stream.go:425] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=d4c1cc21d4c817ca] [remote-peer-id=1baed0b840546f7]
[2023/10/30 14:48:34.419 +08:00] [PANIC] [log.go:203] ["tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:48:34.419 +08:00] [PANIC] [log.go:203] ["tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:48:54.770 +08:00] [PANIC] [log.go:203] ["tocommit(4432459) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:13.070 +08:00] [PANIC] [log.go:203] ["tocommit(4432492) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:31.681 +08:00] [PANIC] [log.go:203] ["tocommit(4432521) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]
...
[2023/10/30 14:49:49.816 +08:00] [PANIC] [log.go:203] ["tocommit(4432522) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?"]


panic: tocommit(4432450) is out of range [lastIndex(4432444)]. Was the raft log corrupted, truncated, or lost?

goroutine 249 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00456e9c0, {0x0, 0x0, 0x0})
        /go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:232 +0x44c
go.uber.org/zap.(*SugaredLogger).log(0xc0002bad50, 0x4, {0x2b11764?, 0xdcbc85?}, {0xc007dac240?, 0x26ef300?, 0xc0098aa300?}, {0x0, 0x0, 0x0})
        /go/pkg/mod/go.uber.org/[email protected]/sugar.go:227 +0xee
go.uber.org/zap.(*SugaredLogger).Panicf(...)
        /go/pkg/mod/go.uber.org/[email protected]/sugar.go:159
go.etcd.io/etcd/pkg/logutil.(*zapRaftLogger).Panicf(0x439fec?, {0x2b11764?, 0x2b?}, {0xc007dac240?, 0x1672334?, 0x3cd6081?})
        /go/pkg/mod/go.etcd.io/[email protected]/pkg/logutil/zap_raft.go:101 +0x49
go.etcd.io/etcd/raft.(*raftLog).commitTo(0xc000760150, 0x43a001)
        /go/pkg/mod/go.etcd.io/[email protected]/raft/log.go:203 +0x103
go.etcd.io/etcd/raft.(*raft).handleHeartbeat(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
        /go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1389 +0x45
go.etcd.io/etcd/raft.stepCandidate(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
        /go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1289 +0x585
go.etcd.io/etcd/raft.(*raft).Step(_, {0x8, 0x18cf5532a87c6527, 0xd4c1cc21d4c817ca, 0x18, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
        /go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:990 +0x1055
go.etcd.io/etcd/raft.(*node).run(0xc000118d80)
        /go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:354 +0xa05
created by go.etcd.io/etcd/raft.RestartNode
        /go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:242 +0x24a
@ahrtr
Copy link
Member

ahrtr commented Dec 7, 2023

#17078 (comment)

@zhan81776075
Copy link

zhan81776075 commented Feb 21, 2024

sry,when will backport the fix to 3.5
@ahrtr @overvenus

@ahrtr
Copy link
Member

ahrtr commented Feb 21, 2024

Please feel free to raise two PRs, one for 3.5 and the other for 3.4. thx

@henrybear327
Copy link
Contributor

Hey @ahrtr I can attempt the backport! :)

/assign

@henrybear327
Copy link
Contributor

@ahrtr I think we can close this issue?

@serathius
Copy link
Member

Yes, with fix being backported to all branches and changelog updated we can close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

6 participants