Recover from panic in http and grpc handlers. #2059

cyriltovena · 2020-05-08T19:25:13Z

I don't see any good reason to crash any component during a bad request.

The stacktrace is still printed in multiline to stderr.
e.g

panic: foo
goroutine 50 [running]:
github.com/grafana/loki/pkg/util/server.onPanic(0x16c7720, 0x18af5e0, 0x4fd01d5aaf8, 0x3e7)
	/Users/ctovena/go/src/github.com/grafana/loki/pkg/util/server/recovery.go:41 +0x85
github.com/grpc-ecosystem/go-grpc-middleware/recovery.WithRecoveryHandler.func1.1(0x18d2140, 0xc000126008, 0x16c7720, 0x18af5e0, 0x1d81ca0, 0x1663693)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/recovery/options.go:33 +0x39
github.com/grpc-ecosystem/go-grpc-middleware/recovery.recoverFrom(0x18d2140, 0xc000126008, 0x16c7720, 0x18af5e0, 0xc0002166c0, 0x1d5ed32, 0xd0)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/recovery/interceptors.go:52 +0x5a
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1.1(0x18d2140, 0xc000126008, 0xc000204088, 0xc00026af28)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/recovery/interceptors.go:26 +0x77
panic(0x16c7720, 0x18af5e0)
	/usr/local/Cellar/go/1.14.2_1/libexec/src/runtime/panic.go:969 +0x166
github.com/grafana/loki/pkg/util/server.Test_onPanic.func3(0x18d2140, 0xc000126008, 0x0, 0x0, 0xc00026aed8, 0x16642ab, 0x2575318, 0xc000263200)
	/Users/ctovena/go/src/github.com/grafana/loki/pkg/util/server/recovery_test.go:32 +0x39
github.com/grpc-ecosystem/go-grpc-middleware/recovery.UnaryServerInterceptor.func1(0x18d2140, 0xc000126008, 0x0, 0x0, 0x0, 0x18075d8, 0x0, 0x0, 0x0, 0x0)
	/Users/ctovena/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/recovery/interceptors.go:30 +0xbc
github.com/grafana/loki/pkg/util/server.Test_onPanic(0xc000263200)
	/Users/ctovena/go/src/github.com/grafana/loki/pkg/util/server/recovery_test.go:31 +0x2c9
testing.tRunner(0xc000263200, 0x18075e0)
	/usr/local/Cellar/go/1.14.2_1/libexec/src/testing/testing.go:991 +0xdc
created by testing.(*T).Run
	/usr/local/Cellar/go/1.14.2_1/libexec/src/testing/testing.go:1042 +0x357

Signed-off-by: Cyril Tovena [email protected]

I don't see any good reason to crash any component during a bad request. Signed-off-by: Cyril Tovena <[email protected]>

Signed-off-by: Cyril Tovena <[email protected]>

codecov-io · 2020-05-08T19:52:20Z

Codecov Report

Merging #2059 into master will increase coverage by 0.17%.
The diff coverage is 42.30%.

@@            Coverage Diff             @@
##           master    #2059      +/-   ##
==========================================
+ Coverage   64.02%   64.19%   +0.17%     
==========================================
  Files         133      136       +3     
  Lines       10222    10240      +18     
==========================================
+ Hits         6545     6574      +29     
+ Misses       3187     3175      -12     
- Partials      490      491       +1

Impacted Files	Coverage Δ
pkg/loki/fake_auth.go	`0.00% <0.00%> (ø)`
pkg/loki/loki.go	`0.00% <0.00%> (ø)`
pkg/loki/modules.go	`15.31% <0.00%> (-0.48%)`	⬇️
pkg/querier/http.go	`0.00% <0.00%> (-10.15%)`	⬇️
pkg/util/server/error.go	`100.00% <100.00%> (ø)`
pkg/util/server/middleware.go	`100.00% <100.00%> (ø)`
pkg/util/server/recovery.go	`100.00% <100.00%> (ø)`
... and 2 more

owen-d

This looks good but I'm not sure about putting panic recovery on the write path. I believe we wanted to recover on the read path and specifically on ingester reads which is the one area our reads/write coincide and thus a place where read panics could cause data loss.

owen-d · 2020-05-11T13:15:32Z

pkg/loki/loki.go

@@ -140,37 +141,33 @@ func New(cfg Config) (*Loki, error) {
 }

 func (t *Loki) setupAuthMiddleware() {
+	t.cfg.Server.GRPCMiddleware = []grpc.UnaryServerInterceptor{serverutil.RecoveryGRPCUnaryInterceptor}


I'm not sure if we want recovery middleware everywhere -- what about just on the read path?

cyriltovena · 2020-05-11T13:27:52Z

This looks good but I'm not sure about putting panic recovery on the write path. I believe we wanted to recover on the read path and specifically on ingester reads which is the one area our reads/write coincide and thus a place where read panics could cause data loss.

This was intentional, I don't see why an API call read or write should cause a crash and data loss. Even crashing a distributor can cause an outage, imagine a situation where we have a specific stream that takes down a distributor. In a very short time, it could bring the whole write path down, for all tenant.

Again I don't see a situation where we would want to leave a component crash on request.

slim-bean · 2020-05-11T14:26:02Z

Again I don't see a situation where we would want to leave a component crash on request.

I would second this, read/write we never want a panic. Write is actually a much worse scenario because it would guarantee round robin crash of every component because writes are consistent and always happening.

owen-d

lgtm

Recover from panic in http and grpc handlers.

72cff07

I don't see any good reason to crash any component during a bad request. Signed-off-by: Cyril Tovena <[email protected]>

pull-request-size bot added the size/L label May 8, 2020

cyriltovena added 2 commits May 8, 2020 15:32

Add alerts to the mixin for panics.

152328e

Signed-off-by: Cyril Tovena <[email protected]>

😡 gomod

8b665c1

Signed-off-by: Cyril Tovena <[email protected]>

owen-d reviewed May 11, 2020

View reviewed changes

owen-d approved these changes May 11, 2020

View reviewed changes

cyriltovena merged commit bce4470 into grafana:master May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from panic in http and grpc handlers. #2059

Recover from panic in http and grpc handlers. #2059

cyriltovena commented May 8, 2020

codecov-io commented May 8, 2020

owen-d left a comment

owen-d May 11, 2020

cyriltovena commented May 11, 2020

slim-bean commented May 11, 2020

owen-d left a comment

Recover from panic in http and grpc handlers. #2059

Recover from panic in http and grpc handlers. #2059

Conversation

cyriltovena commented May 8, 2020

codecov-io commented May 8, 2020

Codecov Report

owen-d left a comment

Choose a reason for hiding this comment

owen-d May 11, 2020

Choose a reason for hiding this comment

cyriltovena commented May 11, 2020

slim-bean commented May 11, 2020

owen-d left a comment

Choose a reason for hiding this comment