Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.4] panic: runtime error: invalid memory address or nil pointer dereference #14256

Open
ahrtr opened this issue Jul 22, 2022 · 18 comments
Open

Comments

@ahrtr
Copy link
Member

ahrtr commented Jul 22, 2022

I saw this error multiple times in 3.4 pipeline,

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xab2922]
goroutine 3500 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/home/runner/go/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000598f00, 0x15792dc, 0x3, 0x3, 0xc0000e34e8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/backend/read_tx.go:195 +0x642
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x67, 0x0, 0x66, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000598f30, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0xc00005e800, ...)
	/home/runner/work/etcd/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000598f60, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x40fa00, ...)
	/home/runner/work/etcd/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0005d95c0, 0xc0001747a8, 0x8, 0x8, 0x0, 0x0, 0x0, 0x0, 0x66, 0x0, ...)
	/home/runner/work/etcd/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:405 +0x1ef
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc001bb20c0)
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/home/runner/work/etcd/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
FAIL	go.etcd.io/etcd/etcdserver/api/v2store	3.484s
FAIL

Refer to https://github.com/etcd-io/etcd/runs/7460637358?check_suite_focus=true

Please see my comment: #14256 (comment)

@ahrtr
Copy link
Member Author

ahrtr commented Jul 26, 2022

@ramses, any update on this issue?

@ahrtr
Copy link
Member Author

ahrtr commented Jul 27, 2022

I suggest to diff the code between release-3.4 and (releasae-3.5 or main). cc @SimFG to take a look as well.

If there is no any progress until early next week, then I may take a deep dive myself.

@ramses
Copy link

ramses commented Jul 28, 2022

@ramses, any update on this issue?

@ahrtr, I was not investigating this issue; I was investigating the "split-brain" issue: #14143

Of these two, which one is currently more concerning to you?

@SimFG
Copy link
Contributor

SimFG commented Jul 28, 2022

@ahrtr Ok, I will look it.

@ahrtr
Copy link
Member Author

ahrtr commented Jul 28, 2022

I suggest to resolve this issue firstly. @SimFG @ramses

Comment on #14143,

  1. I am NOT convinced that it's real issue of etcd. Apart from the lazyfs issue, I even suspect it's issue on the client side, including the test case or jetcd. Of course, it needs more investigation;
  2. We can start with qualify our corruption detection mechanism (such as Fix corruption checks v3.5 #14282 ) and test process (such as Jepsen or our functional test) firstly

@SimFG
Copy link
Contributor

SimFG commented Jul 28, 2022

@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.

@ramses
Copy link

ramses commented Jul 28, 2022

  1. I am NOT convinced that it's real issue of etcd. [...] Of course, it needs more investigation;

I agree on both counts, @ahrtr. I'll keep digging down to be sure which one is it.

@ramses You can go on to see your current issue to avoid us doing repetitive work. cc @ahrtr, if there is any difficulty, I will keep you informed.

Noted, @SimFG . By the way, nice to meet you (in GitHub-verse) 🙂

@SimFG
Copy link
Contributor

SimFG commented Aug 1, 2022

@ahrtr I didn't find a solution to this problem. At first I thought it was because tx did not call the begin method, which caused the error that tx was nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in the tx initialization not being run, but the verification did not get the result.

@ahrtr
Copy link
Member Author

ahrtr commented Aug 1, 2022

@ahrtr I didn't find a solution to this problem. At first I thought it was because tx did not call the begin method, which caused the error that tx was nil, but it was not the case after a try. Then it is guessed that the server failed to start, resulting in the tx initialization not being run, but the verification did not get the result.

Thanks @SimFG for the feedback. Let me take care of this issue.

@ahrtr
Copy link
Member Author

ahrtr commented Aug 2, 2022

This issue can happen on both main and release-3.4, and I believe release-3.5 has this issue as well.

The root cause is the etcd just stops immediately before (*readView) Range is called, so the tx is nil; accordingly the following range operation panics.

I managed to reproduce this issue intentionally by simulating the situation mentioned above. The code change (on 3.4) is below,

wachao-a01:etcd wachao$ git diff
diff --git a/etcdmain/etcd.go b/etcdmain/etcd.go
index 73328a73d..ca76e3910 100644
--- a/etcdmain/etcd.go
+++ b/etcdmain/etcd.go
@@ -294,6 +294,7 @@ func startEtcdOrProxyV2() {
        case <-stopped:
        }
 
+       time.Sleep(10 * time.Second)
        osutil.Exit(0)
 }
 
diff --git a/etcdserver/api/v3rpc/watch.go b/etcdserver/api/v3rpc/watch.go
index 6a9204024..e7c27713a 100644
--- a/etcdserver/api/v3rpc/watch.go
+++ b/etcdserver/api/v3rpc/watch.go
@@ -402,6 +402,11 @@ func (sws *serverWatchStream) sendLoop() {
                                events[i] = &evs[i]
                                if needPrevKV && !isCreateEvent(evs[i]) {
                                        opt := mvcc.RangeOptions{Rev: evs[i].Kv.ModRevision - 1}
+                                       var s *etcdserver.EtcdServer = sws.sg.(*etcdserver.EtcdServer)
+                                       go func() {
+                                               s.Stop()
+                                       }()
+                                       time.Sleep(1*time.Second)
                                        r, err := sws.watchable.Range(evs[i].Kv.Key, nil, opt)
                                        if err == nil && len(r.KVs) != 0 {
                                                events[i].PrevKv = &(r.KVs[0])
diff --git a/mvcc/backend/read_tx.go b/mvcc/backend/read_tx.go
index 91fe72ec5..f088e5fa9 100644
--- a/mvcc/backend/read_tx.go
+++ b/mvcc/backend/read_tx.go
@@ -16,6 +16,7 @@ package backend
 
 import (
        "bytes"
+       "fmt"
        "math"
        "sync"
 
@@ -192,6 +193,9 @@ func (rt *concurrentReadTx) UnsafeRange(bucketName, key, endKey []byte, limit in
        rt.txMu.RUnlock()
        if !ok {
                rt.txMu.Lock()
+               if rt.tx == nil {
+                       fmt.Println("###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!")
+               }
                bucket = rt.tx.Bucket(bucketName)
                rt.buckets[bn] = bucket
                rt.txMu.Unlock()

Then the issue can always be reproduced when running test TestV3WatchWithPrevKV,

wachao-a01:integration wachao$ go test -v -cpu 2 -run TestV3WatchWithPrevKV
=== RUN   TestV3WatchWithPrevKV
###### IN (*concurrentReadTx) UnsafeRange, tr.tx is nil!!!
    v3_watch_test.go:1212: timeout waiting for watch response
{"level":"warn","ts":"2022-08-02T10:38:23.889+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-e005a80b-1a7c-4a2c-83ed-a52cd49a61c4/localhost:83867718980294516900","attempt":0,"error":"rpc error: code = Unknown desc = raft: stopped"}
    v3_watch_test.go:1168: rpc error: code = Unknown desc = raft: stopped
    v3_watch_test.go:1198: rpc error: code = Canceled desc = context canceled
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x174694d]

goroutine 216 [running]:
go.etcd.io/bbolt.(*Tx).Bucket(...)
	/Users/wachao/go/gopath/pkg/mod/go.etcd.io/[email protected]/tx.go:101
go.etcd.io/etcd/mvcc/backend.(*concurrentReadTx).UnsafeRange(0xc000620300, 0x2316484, 0x3, 0x3, 0xc00011e5b8, 0x11, 0x12, 0x0, 0x0, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/backend/read_tx.go:199 +0x64d
go.etcd.io/etcd/mvcc.(*storeTxnRead).rangeKeys(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x3, 0x0, 0x2, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:147 +0x293
go.etcd.io/etcd/mvcc.(*storeTxnRead).Range(0xc000620330, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000100000, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kvstore_txn.go:51 +0xaf
go.etcd.io/etcd/mvcc.(*metricsTxnWrite).Range(0xc000620360, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0x0, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/metrics_txn.go:37 +0xaf
go.etcd.io/etcd/mvcc.(*readView).Range(0xc0002101f0, 0xc00022ad10, 0x3, 0x8, 0x0, 0x0, 0x0, 0x0, 0x2, 0xc000351300, ...)
	/Users/wachao/go/src/github.com/ahrtr/etcd/mvcc/kv_view.go:39 +0x125
go.etcd.io/etcd/etcdserver/api/v3rpc.(*serverWatchStream).sendLoop(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:410 +0x276
go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch.func1(0xc000176b40)
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:180 +0x2b
created by go.etcd.io/etcd/etcdserver/api/v3rpc.(*watchServer).Watch
	/Users/wachao/go/src/github.com/ahrtr/etcd/etcdserver/api/v3rpc/watch.go:179 +0x285
exit status 2
FAIL	go.etcd.io/etcd/integration	31.025s

The possibility of running into this issue is really low in production environment. It's even harder to reproduce this issue in the pipeline after merging #14290 . Since the etcdserver should have already been stopped when running into this issue, it means the last transaction should have already been committed, so it should be safe. So I don't think it's a blocker for 3.4.20 any more.

I will think about a formal fix or refactoring in future.

@JohnJAS
Copy link

JohnJAS commented Aug 29, 2022

i think i've met the same issue, start etcd v3.4.20 via static pod in a k8s cluster

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]

goroutine 265 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc0075d9300)
    /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc0075d9300, 0x2, 0x2)
    /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc0000b0600, 0xc000dabc20, 0x5f382e3fe91c3e3, 0x0, 0x0, 0x0)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc0000b0600)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc0000b0600, 0xc0007f5d60)
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
    /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1

@ahrtr
Copy link
Member Author

ahrtr commented Aug 29, 2022

One quick question, did you just see this issue once or can you easily reproduce this issue? @JohnJAS

@JohnJAS
Copy link

JohnJAS commented Aug 29, 2022

hi @ahrtr
It is hard for me to answer the question now because I only have checked one env and I found this issue inside the dead etcd container.
But it is true that we kept met with the ETCD issue after we've upgraded etcd to v3.4.20. Our daily pipeline was blocked for several days until we downgraded the etcd version.
I will try to manually upgrade etcd to see if I can reproduce this issue easily. And I will update my comments later.

@JohnJAS
Copy link

JohnJAS commented Aug 29, 2022

@ahrtr I've reproduced this issue on another cluster and just hit on the same issue.

I manually upgraded the etcd pods on each node one by one from v3.4.16 to v3.4.20. Then etcd pods crashed and exited like the following.

Error logs:

2022-08-29 06:29:02.702707 E | rafthttp: failed to read 88d5d8ebd854df5a on stream MsgApp v2 (unexpected EOF)
2022-08-29 06:29:02.702723 I | rafthttp: peer 88d5d8ebd854df5a became inactive (message send to peer failed)
2022-08-29 06:29:02.703504 W | rafthttp: lost the TCP streaming connection with peer 88d5d8ebd854df5a (stream Message reader)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x75745b]

goroutine 287 [running]:
go.uber.org/zap.(*Logger).check(0x0, 0x1, 0x10a3ea5, 0x36, 0xc003fb7480)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:264 +0x9b
go.uber.org/zap.(*Logger).Warn(0x0, 0x10a3ea5, 0x36, 0xc003fb7480, 0x2, 0x2)
        /root/go/pkg/mod/go.uber.org/[email protected]/logger.go:194 +0x45
go.etcd.io/etcd/etcdserver.(*EtcdServer).requestCurrentIndex(0xc000398600, 0xc005210f00, 0xd15682e845891c12, 0x0, 0x0, 0x0)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:805 +0x873
go.etcd.io/etcd/etcdserver.(*EtcdServer).linearizableReadLoop(0xc000398600)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/v3_server.go:721 +0x2d6
go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach.func1(0xc000398600, 0xc00011ec20)
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2698 +0x57
created by go.etcd.io/etcd/etcdserver.(*EtcdServer).goAttach
        /tmp/etcd-release-3.4.20/etcd/release/etcd/etcdserver/server.go:2696 +0x1b1

Add something I have observed. When I try to use helm upgrade command which will call kube-apiserver API internally (kube-apiserver depends on etcd in the backend), this will always trigger the etcd crashed.

So I think this issue is not related the env and it is easy to reproduce.

However, ETCD v3.4.16 works well, but we still got some CVEs on this version and that's why we want to upgrade ETCD.

I will test etcd v3.4.19 which including the fix for those CVEs tomorrow.

Update later.

@rtheis
Copy link

rtheis commented Aug 29, 2022

We hit the same problem with etcd version 3.4.20 and it is very easy to reproduce in our environment.

@JohnJAS
Copy link

JohnJAS commented Aug 30, 2022

I can confirm that etcd v3.4.19 doesn't have this issue and we are happy to see that the two CVEs are fixed in this version. It looks like some fix in v3.4.20 had imported the current issue.

However, the error message in my case is not 100% as yours, though they are all about the nil pointor issue. I won't create another one in case if they have the same root cause.

Hope my research can help to narrow down the range of root cause.

@ahrtr
Copy link
Member Author

ahrtr commented Aug 30, 2022

Thanks @JohnJAS and @rtheis for the feedback. It seems that it's a different issue with this one, I just raised another issue #14402 to track it. Please feel free to deliver a PR for it if you are interested.

@AnassZn
Copy link

AnassZn commented Apr 29, 2024

runtime error: index out of range [0] with length 0
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xd4d846]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants