Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Receive (RouteOnly mode) Panic #6942

Open
jnyi opened this issue Nov 29, 2023 · 21 comments
Open

Thanos Receive (RouteOnly mode) Panic #6942

jnyi opened this issue Nov 29, 2023 · 21 comments

Comments

@jnyi
Copy link
Contributor

jnyi commented Nov 29, 2023

Thanos, Prometheus and Golang version used:
Thanos Version: 0.32.4/0.32.5
Golang Version: go1.21.3

Object Storage Provider: AWS S3

What happened: Thanos Receive with route only mode panic frequently, the setup of Receive:

      receive
      --debug.name=thanos-writer
      --log.format=logfmt
      --log.level=info
      --http-address=0.0.0.0:10902
      --http-grace-period=5m
      --grpc-address=0.0.0.0:10901
      --grpc-grace-period=5m
      --hash-func=SHA256
      --label
      replica="$(NAME)"
      --receive.default-tenant-id=unknown
      --remote-write.address=0.0.0.0:19291
      --receive-forward-timeout=15s
      --receive.hashrings-algorithm=ketama
      --receive.hashrings-file=/var/lib/tsdb/hashring.json
      --receive.hashrings-file-refresh-interval=3m
      --receive.replication-factor=3

What you expected to happen: No panic

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:
Panic from k8s docker logs

ts=2023-11-29T16:15:45.401596471Z caller=receive.go:535 level=info name=thanos-writer component=receive msg="Set up hashring for the given hashring config."
ts=2023-11-29T16:15:45.401626861Z caller=intrumentation.go:56 level=info name=thanos-writer component=receive msg="changing probe status" status=ready
runtime: g17389085: frame.sp=0xc0055cbe58 top=0xc0055cbfe0
	stack=[0xc00554c000-0xc0055cc000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0x26b2ab8?, 0x0?})
	/usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0xc000a0fd40 sp=0xc000a0fd10 pc=0x43b45c
runtime.(*unwinder).finishInternal(0x0?)
	/usr/local/go/src/runtime/traceback.go:571 +0x12a fp=0xc000a0fd80 sp=0xc000a0fd40 pc=0x461d4a
runtime.(*unwinder).next(0xc000a0fe28?)
	/usr/local/go/src/runtime/traceback.go:452 +0x232 fp=0xc000a0fdf8 sp=0xc000a0fd80 pc=0x461b52
runtime.addOneOpenDeferFrame.func1()
	/usr/local/go/src/runtime/panic.go:648 +0x85 fp=0xc000a0ffc8 sp=0xc000a0fdf8 pc=0x43a605
traceback: unexpected SPWRITE function runtime.systemstack
runtime.systemstack()
	/usr/local/go/src/runtime/asm_amd64.s:509 +0x4a fp=0xc000a0ffd8 sp=0xc000a0ffc8 pc=0x46f70a

goroutine 17389085 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0055cbd68 sp=0xc0055cbd58 pc=0x46f6a8
runtime.addOneOpenDeferFrame(0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/panic.go:645 +0x65 fp=0xc0055cbda8 sp=0xc0055cbd68 pc=0x43a525
panic({0x2219b40?, 0x4238400?})
	/usr/local/go/src/runtime/panic.go:874 +0x14a fp=0xc0055cbe58 sp=0xc0055cbda8 pc=0x43adca
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:261
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:861 +0x378 fp=0xc0055cbeb8 sp=0xc0055cbe58 pc=0x452418
created by github.com/klauspost/compress/s2.(*Writer).write in goroutine 17389042
	/go/pkg/mod/github.com/klauspost/[email protected]/s2/writer.go:505 +0xb5

.... more goroutine stack trace

goroutine 37381 [IO wait]:
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGSEGV: segmentation violation code=0x1 addr=0x118 pc=0x45fcfc]

runtime stack:
runtime.throw({0x27e0bfb?, 0x3fcaf40?})
	/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc0010c97d8 sp=0xc0010c97a8 pc=0x43907d
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0xc0010c9838 sp=0xc0010c97d8 pc=0x4503a9
runtime.gentraceback(0x3f72aa0?, 0x3fcaf40?, 0xc0010c9bf0?, 0xc0007029c0, 0x0, 0x0, 0x64, 0x0, 0xc0010c9c10?, 0x0)
	/usr/local/go/src/runtime/traceback.go:258 +0x8bc fp=0xc0010c9b90 sp=0xc0010c9838 pc=0x45fcfc
runtime.traceback1(0xc0007029c0?, 0x43ab00?, 0x3?, 0xc0007029c0, 0x462afb?)
	/usr/local/go/src/runtime/traceback.go:776 +0x1b6 fp=0xc0010c9d50 sp=0xc0010c9b90 pc=0x461df6
runtime.traceback(...)
	/usr/local/go/src/runtime/traceback.go:723
runtime.tracebackothers.func1(0xc0007029c0)
	/usr/local/go/src/runtime/traceback.go:992 +0xe5 fp=0xc0010c9d90 sp=0xc0010c9d50 pc=0x462d25
runtime.forEachGRace(0xc0010c9df8)
	/usr/local/go/src/runtime/proc.go:604 +0x4d fp=0xc0010c9dc0 sp=0xc0010c9d90 pc=0x43c90d
runtime.tracebackothers(0xc00052d520?)
	/usr/local/go/src/runtime/traceback.go:978 +0xe5 fp=0xc0010c9e28 sp=0xc0010c9dc0 pc=0x462c05
runtime.dopanic_m(0xc00052d520, 0x2cdf6e8?, 0x1?)
	/usr/local/go/src/runtime/panic.go:1273 +0x285 fp=0xc0010c9ea0 sp=0xc0010c9e28 pc=0x439a65
runtime.fatalthrow.func1()
	/usr/local/go/src/runtime/panic.go:1127 +0x6e fp=0xc0010c9ee0 sp=0xc0010c9ea0 pc=0x43946e
runtime.fatalthrow(0x10c9f28?)
	/usr/local/go/src/runtime/panic.go:1120 +0x6c fp=0xc0010c9f20 sp=0xc0010c9ee0 pc=0x4393cc
runtime.throw({0x2798afc?, 0x100000004?})
	/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc0010c9f50 sp=0xc0010c9f20 pc=0x43907d
runtime.ready(0xc0054d3520, 0x466fe5?, 0x0?)
	/usr/local/go/src/runtime/proc.go:885 +0x1eb fp=0xc0010c9fa0 sp=0xc0010c9f50 pc=0x43d34b
runtime.goready.func1()
	/usr/local/go/src/runtime/proc.go:392 +0x26 fp=0xc0010c9fc8 sp=0xc0010c9fa0 pc=0x43bee6
runtime.systemstack()
	/usr/local/go/src/runtime/asm_amd64.s:496 +0x49 fp=0xc0010c9fd0 sp=0xc0010c9fc8 pc=0x46e0c9

Anything else we need to know:

Environment:

  • OS (e.g. from /etc/os-release):
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.18.4
PRETTY_NAME="Alpine Linux v3.18"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"
  • Kernel (e.g. uname -a): Linux thanos-writer-deployment-77677f8cb8-92h7x 5.4.0-1113-aws-fips #123+fips1-Ubuntu SMP Thu Oct 19 16:21:22 UTC 2023 x86_64 Linux
  • Others:

-->

@yeya24
Copy link
Contributor

yeya24 commented Nov 29, 2023

This seems the line that has caused the panic. https://github.com/klauspost/compress/blob/v1.16.7/s2/writer.go#L505C1-L506C1

But Idk how this could be related.Maybe it is a go runtime bug. I see some issues mentioning it for go 1.21.0 golang/go#62182, but not on amd64 and your version is already go 1.21.3

@jnyi
Copy link
Contributor Author

jnyi commented Nov 29, 2023

Interesting, we did use our own base OS here Alpine Linux would it be reported by other thanos users? Also found another logs related to this:


goroutine 7613 [IO wait, 3 minutes]:
runtime.gopark(0x1dcd?, 0x10006?, 0x0?, 0x0?, 0xc002b44000?)
	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc0037798c8 sp=0xc0037798a8 pc=0x43e2ae
panic during panic
SIGSEGV: segmentation violation
PC=0x461a25 m=7 sigcode=1

goroutine 0 [idle]:
runtime.(*unwinder).next(0xc000613960)
	/usr/local/go/src/runtime/traceback.go:463 +0x105 fp=0xc0006133f0 sp=0xc000613378 pc=0x461a25
runtime.traceback2(0xc000613960, 0x0, 0x0, 0x31)
	/usr/local/go/src/runtime/traceback.go:987 +0x125 fp=0xc000613658 sp=0xc0006133f0 pc=0x4632c5
runtime.traceback1.func1(0x20?)
	/usr/local/go/src/runtime/traceback.go:923 +0x65 fp=0xc000613828 sp=0xc000613658 pc=0x463085
runtime.traceback1(0xc000502d00?, 0x43cf00?, 0x3?, 0xc000502d00, 0xfb?)
	/usr/local/go/src/runtime/traceback.go:946 +0x212 fp=0xc000613b38 sp=0xc000613828 pc=0x462ef2
runtime.traceback(...)
	/usr/local/go/src/runtime/traceback.go:823
runtime.tracebackothers.func1(0xc000502d00)
	/usr/local/go/src/runtime/traceback.go:1254 +0xe5 fp=0xc000613b78 sp=0xc000613b38 pc=0x464805
runtime.forEachGRace(0xc000613be0)
	/usr/local/go/src/runtime/proc.go:621 +0x49 fp=0xc000613ba8 sp=0xc000613b78 pc=0x43ecc9
runtime.tracebackothers(0xc0006021a0?)
	/usr/local/go/src/runtime/traceback.go:1240 +0xdb fp=0xc000613c10 sp=0xc000613ba8 pc=0x4646fb
runtime.dopanic_m(0xc0006021a0, 0x2c4e378?, 0x1?)
	/usr/local/go/src/runtime/panic.go:1316 +0x2a6 fp=0xc000613c90 sp=0xc000613c10 pc=0x43bde6
runtime.fatalthrow.func1()
	/usr/local/go/src/runtime/panic.go:1170 +0x6b fp=0xc000613cd0 sp=0xc000613c90 pc=0x43b80b
runtime.fatalthrow(0x613d18?)
	/usr/local/go/src/runtime/panic.go:1163 +0x65 fp=0xc000613d10 sp=0xc000613cd0 pc=0x43b765
runtime.throw({0x26b2ab8?, 0x0?})
	/usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0xc000613d40 sp=0xc000613d10 pc=0x43b45c
runtime.(*unwinder).finishInternal(0x0?)
	/usr/local/go/src/runtime/traceback.go:571 +0x12a fp=0xc000613d80 sp=0xc000613d40 pc=0x461d4a
runtime.(*unwinder).next(0xc000613e28?)
	/usr/local/go/src/runtime/traceback.go:452 +0x232 fp=0xc000613df8 sp=0xc000613d80 pc=0x461b52
runtime.addOneOpenDeferFrame.func1()
	/usr/local/go/src/runtime/panic.go:648 +0x85 fp=0xc000613fc8 sp=0xc000613df8 pc=0x43a605
traceback: unexpected SPWRITE function runtime.systemstack
runtime.systemstack()
	/usr/local/go/src/runtime/asm_amd64.s:509 +0x4a fp=0xc000613fd8 sp=0xc000613fc8 pc=0x46f70a

goroutine 22119648 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0046d7d68 sp=0xc0046d7d58 pc=0x46f6a8
runtime.addOneOpenDeferFrame(0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/panic.go:645 +0x65 fp=0xc0046d7da8 sp=0xc0046d7d68 pc=0x43a525
panic({0x2219b40?, 0x4238400?})
	/usr/local/go/src/runtime/panic.go:874 +0x14a fp=0xc0046d7e58 sp=0xc0046d7da8 pc=0x43adca
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:261
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:861 +0x378 fp=0xc0046d7eb8 sp=0xc0046d7e58 pc=0x452418
created by github.com/klauspost/compress/s2.(*Writer).write in goroutine 22119666
	/go/pkg/mod/github.com/klauspost/[email protected]/s2/writer.go:505 +0xb5

@jnyi
Copy link
Contributor Author

jnyi commented Dec 1, 2023

Report a different panic error stack trace related to HTTP2 client?

fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x437ffd]

runtime stack:
runtime.throw({0x27e0bfb?, 0x395b0f0?})
	/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc00010ded8 sp=0xc00010dea8 pc=0x43907d
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0xc00010df38 sp=0xc00010ded8 pc=0x4503a9
runtime.printpanics(0x36343234342d7270)
	/usr/local/go/src/runtime/panic.go:592 +0x1d fp=0xc00010df58 sp=0xc00010df38 pc=0x437ffd
runtime.printpanics(0xc002abea38)
	/usr/local/go/src/runtime/panic.go:593 +0x2e fp=0xc00010df78 sp=0xc00010df58 pc=0x43800e
runtime.fatalpanic.func1()
	/usr/local/go/src/runtime/panic.go:1162 +0x6e fp=0xc00010dfc8 sp=0xc00010df78 pc=0x4395ee
runtime.systemstack()
	/usr/local/go/src/runtime/asm_amd64.s:496 +0x49 fp=0xc00010dfd0 sp=0xc00010dfc8 pc=0x46e0c9

goroutine 97342453 [running]:
runtime: g 97342453: unexpected return pc for runtime.systemstack_switch called from 0x5f5f656d616e5f5f
stack: frame={sp:0xc002abe968, fp:0xc002abe970} stack=[0xc002abe000,0xc002abf000)
0x000000c002abe868:  0x69767265732d6c61  0x2d6562756b2e6563
0x000000c002abe878:  0x656d2d6574617473  0x76732e7363697274
0x000000c002abe888:  0x190a303830383a63  0x6b1212626f6a030a
0x000000c002abe898:  0x746174732d656275  0x63697274656d2d65
0x000000c002abe8a8:  0x6d616e090a210a73  0x1412656361707365
0x000000c002abe8b8:  0x6168732d74736574  0x34342d72702d6472
0x000000c002abe8c8:  0x030a230a31353030  0x6465721c12646f70
0x000000c002abe8d8:  0x73626f6a2d687361  0x633638623436352d
0x000000c002abe8e8:  0x7a3973322d663663  0x676572060a130a6a
0x000000c002abe8f8:  0x2d737509126e6f69  0x1a0a312d74736165
0x000000c002abe908:  0x4e6472616873090a  0x69766e0d12656d61
0x000000c002abe918:  0x642d61696e696772  0x6975030a2b0a7665
0x000000c002abe928:  0x6334393665241264  0x636534332d343331
0x000000c002abe938:  0x34622d613962342d  0x34653262312d3462
0x000000c002abe948:  0x1230336533303535  0x58a54d4000000910
0x000000c002abe958:  0xc286dc91d81041d9  0x080a2c0a038e0a31
0x000000c002abe968: <0x5f5f656d616e5f5f >0x705f6562756b2012
0x000000c002abe978:  0x61746e6f635f646f  0x6174735f72656e69
0x000000c002abe988:  0x74726174735f6574  0x6c63050a0c0a6465
0x000000c002abe998:  0x737761031264756f  0x746e6f63090a1d0a
0x000000c002abe9a8:  0x69101272656e6961  0x2d65636e6174736e
0x000000c002abe9b8:  0x0a726567616e616d  0x0312766e65030a0a
0x000000c002abe9c8:  0x69080a4e0a766564  0x1265636e6174736e
0x000000c002abe9d8:  0x74732d6562756b42  0x7274656d2d657461
0x000000c002abe9e8:  0x692d33322d736369  0x2d6c616e7265746e
0x000000c002abe9f8:  0x2e65636976726573  0x6174732d6562756b
0x000000c002abea08:  0x697274656d2d6574  0x383a6376732e7363
0x000000c002abea18:  0x6a030a190a303830  0x6562756b1212626f
0x000000c002abea28:  0x6d2d65746174732d  0x210a736369727465
0x000000c002abea38:  0x7073656d616e090a  0x7365741412656361
0x000000c002abea48:  0x2d64726168732d74  0x36343234342d7270
0x000000c002abea58:  0x646f70030a270a34  0x6e6174736e692012
0x000000c002abea68:  0x67616e616d2d6563
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:463 fp=0xc002abe970 sp=0xc002abe968 pc=0x46e060
created by google.golang.org/grpc/internal/transport.newHTTP2Client
	/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_client.go:364 +0x1a1f

goroutine 1 [chan receive, 718 minutes]:
runtime.gopark(0xc000dbfb48?, 0x41b671?, 0xe0?, 0xef?, 0xc000dbfbb0?)
	/usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc00204db30 sp=0xc00204db10 pc=0x43bdd6
runtime.chanrecv(0xc000b90540, 0xc000dbfc48, 0x1)
	/usr/local/go/src/runtime/chan.go:583 +0x49d fp=0xc00204dbc0 sp=0xc00204db30 pc=0x406f9d
runtime.chanrecv1(0xc00072a800?, 0x9?)
	/usr/local/go/src/runtime/chan.go:442 +0x18 fp=0xc00204dbe8 sp=0xc00204dbc0 pc=0x406a98
github.com/oklog/run.(*Group).Run(0xc0004d7230)
	/go/pkg/mod/github.com/oklog/[email protected]/group.go:43 +0x16b fp=0xc00204dc68 sp=0xc00204dbe8 pc=0x62cacb
main.main()
	/app/cmd/thanos/main.go:159 +0x1725 fp=0xc00204df80 sp=0xc00204dc68 pc=0x2016905
runtime.main()
	/usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc00204dfe0 sp=0xc00204df80 pc=0x43b9a7
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00204dfe8 sp=0xc00204dfe0 pc=0x470141

@dctrwatson
Copy link

Lots of: fatal error: traceback did not unwind completely

But one time did get more:

fatal error: slice bounds out of range
fatal error: index out of range
panic during panic
SIGSEGV: segmentation violation
PC=0x472d5c m=9 sigcode=128

goroutine 0 [idle]:
runtime.usleep()
        /usr/local/go/src/runtime/sys_linux_amd64.s:135 +0x3c fp=0xc00118ddb8 sp=0xc00118ddb0 pc=0x472d5c
runtime: g 0: unexpected return pc for runtime.usleep called from 0xc3f06ce4ebcc77b6
stack: frame={sp:0xc00118ddb0, fp:0xc00118ddb8} stack=[0xc00118a000,0xc00118e000)
0x000000c00118dcb0:  0x3c3340f3c227fbf8  0x79bc80d2fdde9458
0x000000c00118dcc0:  0x2dd704326b3009fa  0x17345fc643dff433
0x000000c00118dcd0:  0x3241b266f903f1ae  0x8fc501ce5567bc22
0x000000c00118dce0:  0xa395dd9af6d97aba  0x953d2adbb67eb6c0
0x000000c00118dcf0:  0xc01a1ee400460928  0xcbc78c168ec5bc26
0x000000c00118dd00:  0xf698e3343fa16822  0x426667d3ab61d8bf
0x000000c00118dd10:  0x177a662b55361989  0x47acd314c41d51f7
0x000000c00118dd20:  0xbd52fd18cca6e6a5  0x39f03de62494ad2e
0x000000c00118dd30:  0xc5ea86a4f373a6e4  0x5b1788faeeae30eb
0x000000c00118dd40:  0xe0e5fcf7e312da2b  0xe726c9b9a759e5a5
0x000000c00118dd50:  0x283e9e7ea0398998  0x2694029422d2c0ca
0x000000c00118dd60:  0x80fcd1c94745a41a  0x4c914de599147d77
0x000000c00118dd70:  0xc04b2a9646c165f3  0x66d8d1497674d855
0x000000c00118dd80:  0x00899d5fcbd0b13b  0x1ec6839c21b87faf
0x000000c00118dd90:  0xc188389384810d93  0xdbefaf5fa465a331
0x000000c00118dda0:  0x3599b61340b9afc0  0xcd240136b2b9fd3a
0x000000c00118ddb0: <0xc3f06ce4ebcc77b6 >0x140433104b2f520c
0x000000c00118ddc0:  0xa1606b824380066c  0xc782178cc7735f3b
0x000000c00118ddd0:  0x4f94b88647ff179c  0x36c57536a7b57377
0x000000c00118dde0:  0xed0ccb6daa1e2206  0xb314e84cec4d3134
0x000000c00118ddf0:  0xc5e83fab342ee596  0x5c01897f6b6615cc
0x000000c00118de00:  0x1f5b6e10e1363122  0x4aee5af6695344b2
0x000000c00118de10:  0xcaa70494b921fa78  0xd189d105408cbd11
0x000000c00118de20:  0x181c6f07b4ef7a16  0x1d0d4765832dcd0c
0x000000c00118de30:  0x87060ecd6c84e951  0x274064a173099845
0x000000c00118de40:  0x1e252a1fdca33895  0xd6ccf35a2c68179f
0x000000c00118de50:  0x836e3bab7ed44eb4  0xc056c1a729014758
0x000000c00118de60:  0xc5501cb1b8184f55  0x2919f8a06039fd69
0x000000c00118de70:  0x60660cffa8e2fc2b  0x6dada1c343524be3
0x000000c00118de80:  0xb0d29d92026fd6fd  0xe7524cab3e45f791
0x000000c00118de90:  0xc7dc33955b898e65  0x1aa5aa09fba46f62
0x000000c00118dea0:  0xa2780d11f200ad96  0xf5a944f1d0d62522
0x000000c00118deb0:  0x835d046bf0f64589

goroutine 705932 [running]:
runtime.systemstack_switch

@jnyi
Copy link
Contributor Author

jnyi commented Dec 17, 2023

ok, looks like this klauspost/compress#867 is the root cause and are fixed in #6950, I saw thanos main picked up the newer go mod but not 0.32.5 nor v0.33.0-rc.0.

I will cherry pick the updated go mod in order to fix this internally.

cc @mhoffm-aiven @yeya24 to make sure this gets patched to latest v0.33, thanks

@fpetkovski
Copy link
Contributor

Thanks for reporting back the resolution @jnyi

@jnyi
Copy link
Contributor Author

jnyi commented Dec 18, 2023

actually it might be a false resolution, the panic seems still happening after i upgraded [email protected] but it was very infrequent, I am trying [email protected], i will let it run for a bit longer overnight and report. Sorry for the inclusive post earlier.

@jnyi
Copy link
Contributor Author

jnyi commented Dec 18, 2023

ok, it panic again with /go/pkg/mod/github.com/klauspost/[email protected]/s2/writer.go:505 +0xb5

stack trace:

unexpected fault address 0x0
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x43e1f3]

goroutine 6281927 [running]:
runtime.throw({0x2654fb1?, 0x6d614e6472616873?})
	/usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0xc00468de40 sp=0xc00468de10 pc=0x43b39c
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:875 +0x285 fp=0xc00468dea0 sp=0xc00468de40 pc=0x452265
runtime.gopark(0x6e6f636573696c6c?, 0x656b6375625f7364?, 0x74?, 0xa?, 0x61642d676f6c1412?)
	/usr/local/go/src/runtime/proc.go:399 +0xd3 fp=0xc00468dea8 sp=0xc00468dea0 pc=0x43e1f3
runtime: g 6281927: unexpected return pc for runtime.gopark called from 0x696d5f79636e6574
stack: frame={sp:0xc00468dea0, fp:0xc00468dea8} stack=[0xc00460e000,0xc00468e000)
0x000000c00468dda0:  0x0000000000000001  0x0000000000000001
0x000000c00468ddb0:  0x000000c00468de2d  0x000000000046f632 <runtime.systemstack+0x0000000000000032>
0x000000c00468ddc0:  0x000000c00468de00  0x000000000043b6a5 <runtime.fatalthrow+0x0000000000000065>
0x000000c00468ddd0:  0x000000c00468dde0  0x000000c000f7e680
0x000000c00468dde0:  0x000000000043b6e0 <runtime.fatalthrow.func1+0x0000000000000000>  0x000000c000f7e680
0x000000c00468ddf0:  0x000000000043b39c <runtime.throw+0x000000000000005c>  0x000000c00468de10
0x000000c00468de00:  0x000000c00468de30  0x000000000043b39c <runtime.throw+0x000000000000005c>
0x000000c00468de10:  0x000000c00468de18  0x000000000043b3c0 <runtime.throw.func1+0x0000000000000000>
0x000000c00468de20:  0x0000000002654fb1  0x0000000000000005
0x000000c00468de30:  0x000000c00468de90  0x0000000000452265 <runtime.sigpanic+0x0000000000000285>
0x000000c00468de40:  0x0000000002654fb1  0x6d614e6472616873
0x000000c00468de50:  0x0000000000000000  0x7665642d61696e69
0x000000c00468de60:  0x0000000000091012  0xdc8b969a10401000
0x000000c00468de70:  0x000000c000f7e680  0x5f656d616e5f5f08
0x000000c00468de80:  0x6164676f6c36125f  0x636f645f6e6f6d65
0x000000c00468de90:  0x614c726564616572  0x000000000043e1f3 <runtime.gopark+0x00000000000000d3>
0x000000c00468dea0: <0x696d5f79636e6574 >0x6e6f636573696c6c
0x000000c00468deb0:  0x656b6375625f7364  0x707061030a1b0a74
0x000000c00468dec0:  0x61642d676f6c1412  0x6561642d6e6f6d65
0x000000c00468ded0:  0x0c0a7465736e6f6d  0x1264756f6c63050a
0x000000c00468dee0:  0x0e0a150a73776103  0x72705f64756f6c63
0x000000c00468def0:  0x031272656469766f  0x63150a260a535741
0x000000c00468df00:  0x6f72705f64756f6c  0x65725f7265646976
0x000000c00468df10:  0x57410d126e6f6967  0x5341455f53555f53
0x000000c00468df20:  0x640e0a150a315f54  0x6f7269766e655f62
0x000000c00468df30:  0x440312746e656d6e  0x6264140a1e0a5645
0x000000c00468df40:  0x74616c756765725f  0x616d6f645f79726f
0x000000c00468df50:  0x4c42555006126e69  0x6e65030a0a0a4349
0x000000c00468df60:  0x1d0a766564031276  0x6e6174736e69080a
0x000000c00468df70:  0x322e303111126563  0x3836312e3630312e
0x000000c00468df80:  0x0a160a373737373a  0x756b0f12626f6a03
0x000000c00468df90:  0x736574656e726562  0x0a2a0a73646f702d
0x000000c00468dfa0:  0x656e726562756b17
created by github.com/klauspost/compress/s2.(*Writer).write in goroutine 6281866
	/go/pkg/mod/github.com/klauspost/[email protected]/s2/writer.go:505 +0xb5

goroutine 1 [chan receive, 45 minutes]:
runtime.gopark(0xc000a3fb68?, 0x4105c5?, 0xc0?, 0x9c?, 0x20?)
	/usr/local/go/src/runtime/proc.go:398 +0xce fp=0xc0012efb00 sp=0xc0012efae0 pc=0x43e1ee
runtime.chanrecv(0xc0007d06c0, 0xc000a3fc00, 0x1)
	/usr/local/go/src/runtime/chan.go:583 +0x3cd fp=0xc0012efb78 sp=0xc0012efb00 pc=0x4099ad
runtime.chanrecv1(0xc000a3fc10?, 0x9?)
	/usr/local/go/src/runtime/chan.go:442 +0x12 fp=0xc0012efba0 sp=0xc0012efb78 pc=0x4095b2
github.com/oklog/run.(*Group).Run(0xc000011968)
	/go/pkg/mod/github.com/oklog/[email protected]/group.go:43 +0x155 fp=0xc0012efc20 sp=0xc0012efba0 pc=0x622255
main.main()
	/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:159 +0x1878 fp=0xc0012eff40 sp=0xc0012efc20 pc=0x1f0e7b8
runtime.main()
	/usr/local/go/src/runtime/proc.go:267 +0x2bb fp=0xc0012effe0 sp=0xc0012eff40 pc=0x43dd7b
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1650 +0x1 fp=0xc0012effe8 sp=0xc0012effe0 pc=0x471441

@MichaHoffmann
Copy link
Contributor

I already prepared and built artifacts for 0.33 yesterday; since this is still ongoing ill earmark it for 0.33.1!

@klauspost
Copy link

Could you try building with -tags=noasm?

@klauspost
Copy link

klauspost commented Dec 18, 2023

We have been unable to pinpoint the origin of similar crashes at MinIO. It seems to happen on only select machines and the only reliable workaround we've been using is to compile with go 1.19.x which fixes the issue. I've created an issue (link above this post) to see if we can get to the bottom of this!

@klauspost
Copy link

klauspost commented Dec 18, 2023

@dctrwatson - The Go team is asking for Linux kernel versions. I don't know if you have that, but if you do please add it to

@jnyi

golang/go#64781

@jnyi
Copy link
Contributor Author

jnyi commented Dec 19, 2023

Confirmed after compile thanos with go 1.19 < 1.20, there is no panic happening ~ 10hrs, previously this would happen more than 20+ in 10 hrs for a deployments with 4 instances.

@dctrwatson
Copy link

Per the request in golang/go#64781 I added GODEBUG="gccheckmark=1,gcshrinkstackoff=1,asyncpreemptoff=1" and we have not had a panic in >24h. We used to see at least a couple per hour.

@dctrwatson
Copy link

Ran each GODEBUG flag separately, seems only GODEBUG=gcshrinkstackoff=1 is needed to prevent panics for now

@karsov
Copy link

karsov commented Feb 5, 2024

At my company we were also having a very similar panic for the Thanos Receiver (RouteOnly mode) with version v0.33.0.
Interestingly it was happening only in our most active region.
Also, in our case the stack traces were not printed and the last log we had was fatal: bad g in signal handler or fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?).

Setting the environment variable GODEBUG=gcshrinkstackoff=1 (as suggested above) seems to have completely resolved the issue. Previously we had a restart every couple of hours, now we have none in 4 days.

@RodrigoMenezes-Vantage
Copy link

RodrigoMenezes-Vantage commented Jun 21, 2024

We've been having this happen on bitnami/thanos:0.35.1-debian-12-r1 as well. It happens consistently across all 3 replicas of thanos-receive-distributor we run.

Kernel Version: 6.1.91-99.172.amzn2023.x86_64
OS Image: Amazon Linux 2023.4.20240528
Operating System: linux
Architecture: amd64

runtime: g42867214: frame.sp=0xc0020c1e50 top=0xc0020c1fe0
	stack=[0xc002042000-0xc0020c2000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0x28d4031?, 0x0?})
	/opt/bitnami/go/src/runtime/panic.go:1077 +0x5c fp=0xc000295d40 sp=0xc000295d10 pc=0x43b7dc
runtime.(*unwinder).finishInternal(0x0?)
	/opt/bitnami/go/src/runtime/traceback.go:568 +0x12a fp=0xc000295d80 sp=0xc000295d40 pc=0x46210a
runtime.(*unwinder).next(0xc000295e28?)
	/opt/bitnami/go/src/runtime/traceback.go:449 +0x235 fp=0xc000295df8 sp=0xc000295d80 pc=0x461f15
runtime.addOneOpenDeferFrame.func1()
	/opt/bitnami/go/src/runtime/panic.go:648 +0x85 fp=0xc000295fc8 sp=0xc000295df8 pc=0x43a985
runtime.systemstack()
	/opt/bitnami/go/src/runtime/asm_amd64.s:509 +0x4a fp=0xc000295fd8 sp=0xc000295fc8 pc=0x46fb8a

goroutine 42867214 [running]:
runtime.systemstack_switch()
	/opt/bitnami/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0020c1d60 sp=0xc0020c1d50 pc=0x46fb28
runtime.addOneOpenDeferFrame(0x0?, 0x0?, 0x0?)
	/opt/bitnami/go/src/runtime/panic.go:645 +0x65 fp=0xc0020c1da0 sp=0xc0020c1d60 pc=0x43a8a5
panic({0x23fd040?, 0x4cb1710?})
	/opt/bitnami/go/src/runtime/panic.go:874 +0x14a fp=0xc0020c1e50 sp=0xc0020c1da0 pc=0x43b14a
runtime.panicmem(...)
	/opt/bitnami/go/src/runtime/panic.go:261
runtime.sigpanic()
	/opt/bitnami/go/src/runtime/signal_unix.go:861 +0x378 fp=0xc0020c1eb0 sp=0xc0020c1e50 pc=0x4527f8
created by github.com/klauspost/compress/s2.(*Writer).write in goroutine 11220
	/bitnami/blacksmith-sandox/thanos-0.35.1/pkg/mod/github.com/klauspost/[email protected]/s2/writer.go:509 +0xb5

@yangtian9999
Copy link

yangtian9999 commented Jul 9, 2024

Hi, @RodrigoMenezes-Vantage
I meet the similar issue.

I already patched to add env GODEBUG=gcshrinkstackoff=1. Still monitoring.

Is this related to golang/go#64934 ?

stack: frame={sp:0xc0030dfd98, fp:0xc0030dfda8} stack=[0xc003060000,0xc0030e0000) runtime: g 2704737: unexpected return pc for runtime.systemstack_switch called from 0xc0034965d2 " /opt/bitnami/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0030dfda8 sp=0xc0030dfd98 pc=0x46fb28" runtime.systemstack_switch() goroutine 2704737 [running]: " /opt/bitnami/go/src/runtime/asm_amd64.s:509 +0x4a fp=0xc000613fd8 sp=0xc000613fc8 pc=0x46fb8a" runtime.systemstack() " /opt/bitnami/go/src/runtime/panic.go:1205 +0x69 fp=0xc000613fc8 sp=0xc000613f78 pc=0x43bd09" runtime.fatalpanic.func1() " /opt/bitnami/go/src/runtime/panic.go:594 +0x2d fp=0xc000613f78 sp=0xc000613f58 pc=0x43a76d" runtime.printpanics(0xc0030dfe60) " /opt/bitnami/go/src/runtime/panic.go:593 +0x17 fp=0xc000613f58 sp=0xc000613f38 pc=0x43a757" runtime.printpanics(0x3) goroutine 0 [idle]: PC=0x43a757 m=7 sigcode=1 SIGSEGV: segmentation violation panic during panic " /opt/bitnami/go/src/runtime/proc.go:398 +0xce fp=0xc0019b7ad8 sp=0xc0019b7ab8 pc=0x43e6ce" runtime.gopark(0xc?, 0xc002521348?, 0x8?, 0x0?, 0x0?) goroutine 1 [chan receive, 55 minutes]: " /bitnami/blacksmith-sandox/thanos-0.35.0/pkg/mod/github.com/klauspost/[email protected]/s2/writer.go:509 +0xb5" created by github.com/klauspost/compress/s2.(*Writer).write in goroutine 232

@yangtian9999
Copy link

Hi guys,
Since env GODEBUG is a workaround setting,
may I know if has any update for this thanos-distributor restart issue?

@mvv-dvb
Copy link

mvv-dvb commented Oct 4, 2024

It looks like this is at least fixed in Go 1.22, which then might indicate the next release of Thanos should be able to run without this ENV set i think.

@klauspost
Copy link

klauspost commented Oct 4, 2024

I have seen strange stuff even on Go 1.22 - so I have (just today) merged a change that ditches the large stack: klauspost/compress#1014

It may not be needed, but I couldn't live with the potential problem going forward even if it was a runtime issue. And since I don't have a clean reproducer I though this might be the best approach.

I will probably make a release before too long, so hopefully we can close this down for good. The downside is mainly more clunky code - the performance seems to remain the same AFAICT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants