Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/bucket: merging posting groups on 0.3.2-rc.0 segfault #874

Closed
GiedriusS opened this issue Feb 28, 2019 · 16 comments
Closed

store/bucket: merging posting groups on 0.3.2-rc.0 segfault #874

GiedriusS opened this issue Feb 28, 2019 · 16 comments

Comments

@GiedriusS
Copy link
Member

GiedriusS commented Feb 28, 2019

Sometimes I get a segmentation fault. Seems related to the new posting group merging logic.

Thanos, Prometheus and Golang version used

improbable/thanos:v0.3.2-rc.0

What happened

Thanos Store crashed not long after starting up.

What you expected to happen

Thanos Store to work.

How to reproduce it (as minimally and precisely as possible):

Unfortunately I don't have any reproducer. Perhaps something is visible from the stack trace.

Full logs to relevant components

Logs


level=debug ts=2019-02-28T09:32:11.084257964Z caller=bucket.go:660 msg="Blocks source resolutions" blocks=1 mint=1551338820000 maxt=1551346320000 lset="{monitor=\"prt
level=debug ts=2019-02-28T09:32:12.180438609Z caller=bucket.go:793 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:20 postingsTouchedSizeSum:141
level=debug ts=2019-02-28T09:32:13.467076953Z caller=bucket.go:793 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:10 postingsTouchedSizeSum:125
level=debug ts=2019-02-28T09:32:13.483760212Z caller=bucket.go:793 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:10 postingsTouchedSizeSum:125
level=debug ts=2019-02-28T09:32:13.762218102Z caller=bucket.go:793 msg="series query processed" stats="&{blocksQueried:4 postingsTouched:20 postingsTouchedSizeSum:142
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x580e27]
goroutine 2851 [running]:
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.Merge(0xc44a3e4d00, 0x2, 0x2, 0x20, 0xc477dfa480)
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:375 +0x127
github.com/improbable-eng/thanos/pkg/store.merge(0xc44a3e4d00, 0x2, 0x2, 0xc477dfa480, 0xc5378ff340)
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1246 +0x3f
github.com/improbable-eng/thanos/pkg/store.(*postingGroup).Postings(0xc4316cf000, 0xc4e45c9520, 0x2)
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1242 +0x8c
github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).ExpandedPostings(0xc4316a04b0, 0xc44ab65f40, 0x5, 0x5, 0xbccaca661ee95d3e, 0xd82aef30d19f972c, 0xab0ed
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1199 +0x200
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc4204e2090, 0x1203ce0, 0xc4316ce380, 0xda789a829d316901, 0x3f5ad73b1516dffa, 0xc4267b2480, 0xc
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:484 +0x8c
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x433578, 0x1163da8)
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:706 +0xe6
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc420f2b620, 0xc4316b0690, 0xc44a3bb660)
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

Logs

panic: runtime error: invalid memory address or nil pointer dereference                                                                                               
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x580bc9]                                                                                              
goroutine 153969 [running]:                                                                                                                                          
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc55b47c0c0, 0x2354, 0xc55b47c0c0)                               
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:348 +0x29
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc55b47c0f0, 0x2354, 0x1050040)
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:323 +0x47                                       
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc55b47c0f0, 0x5808ec)                                           
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:344 +0x60                                        
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x12047a0, 0xc55b47c0f0, 0x4, 0x12047a0, 0xc55b47c0f0, 0xc55c145a40, 0x2)    
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:250 +0x57                                       
github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).ExpandedPostings(0xc4646d8ff0, 0xc482c509f0, 0x3, 0x3, 0xfe11a8ab1c95568d, 0x4fd305f69ea32d8a, 0xff00d
b977420f744, 0x56a1fcc2250b3073, 0x0)                                                                                                                                                                               
/home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1202 +0x2e3
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc420456000, 0x1203ce0, 0xc45ec2f180, 0x9b6e30349a296901, 0x47b2c82f01b614f8, 0xc4203f43f0, 0xc4646d8ff0, 0xc45ec997a0, 0xc482c509f0, 0x3, ...)                                                                                                                                                                      /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:484 +0x8c                   
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0xc47b3a2000, 0x0)                                                                             /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:706 +0xe6                                                                        
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc5438b0480, 0xc428c1c7e0, 0xc5412b7540)
 /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27                                                 
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run                                                                                  /home/bartek/Repos/thanosGo/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

@bwplotka
Copy link
Member

bwplotka commented Mar 5, 2019

So essentially it was one go thing? not reprodcible anymore? ):

@GiedriusS
Copy link
Member Author

🤔 I only tried it for a little bit and I thought because it came up pretty fast that I should fill this issue up so that perhaps other people who ran into it can come to discuss it. However, it doesn't seem like so and I will revisit this once we will upgrade to 0.3.2. I will leave it for now just in case. It seems also fishy to me that in the stack trace of this there is /home/bartek/... which probably that you've compiled this on your laptop and perhaps this Docker image doesn't have the newest regression fixes? Either way, gonna come back to this in a bit and will close it if it's not reproducible anymore.

@bwplotka
Copy link
Member

bwplotka commented Mar 5, 2019

haha, interesting. Yea RC images are done from my laptop - mostly for demo (: From the newest master at that point, which should be after all fixes.

This panic looks like something we fixed here: cb38508 but .. this tag should have IMO. (we can double check).

Anyway there must be certain block and certain query that triggered it. Do you remember what query you used? Cannot see anything trival here. Unless the block is malformed on your disk.

@bwplotka
Copy link
Member

bwplotka commented Mar 5, 2019

fun fact. We have:

grpcPanicRecoveryHandler := func(p interface{}) (err error) {
		panicsTotal.Inc()
		level.Error(logger).Log("msg", "recovered from panic", "panic", p, "stack", debug.Stack())
		return status.Errorf(codes.Internal, "%s", p)
	}

But somehow it still crashes container 0.0

@bwplotka
Copy link
Member

bwplotka commented Mar 5, 2019

I think panics are missed because they are triggered in different goroutines..

@hsmade
Copy link
Contributor

hsmade commented Mar 11, 2019

We see this same issue happening almost daily. Using 0.3.1

@bwplotka
Copy link
Member

What query triggers it?

@hsmade
Copy link
Contributor

hsmade commented Mar 11, 2019

I'll do some testing

@bwplotka
Copy link
Member

Plus move to v0.3.2 would be nice (:

@hsmade
Copy link
Contributor

hsmade commented Mar 11, 2019

query: api/datasources/proxy/18/api/v1/query_range?query=sum(irate(rt%3Aserver%3Arequests%5B2m%5D))&start=1551096000&end=1552316400&step=10800

debug log attached
thanos-store.stderr.json.txt

I actually just upgraded to 0.3.1 when I found out about 0.3.2 :P

@hsmade
Copy link
Contributor

hsmade commented Mar 11, 2019

I upgraded to 0.3.2 and the panic is gone.
Now I get a proper error:

  "status": "error",
  "errorType": "timeout",
  "error": "query timed out in expression evaluation",
  "message": "query timed out in expression evaluation"
}
	```

@hsmade
Copy link
Contributor

hsmade commented Mar 12, 2019

And the panic is back again:

[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x58070a]

goroutine 31149 [running]:
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*removedPostings).Seek(0xc463f143f0, 0x163403, 0xc42056c800)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:454 +0x4a
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Seek(0xc463f14420, 0x163403, 0xc463f14420)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:348 +0x3d
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).doNext(0xc463f14450, 0x163403, 0x1050c60)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:323 +0x47
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc463f14450, 0x57fbac)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:344 +0x60
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.ExpandPostings(0x1205820, 0xc463f14450, 0x4, 0x1205820, 0xc463f14450, 0xc442163c80, 0x2)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:250 +0x57
github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).ExpandedPostings(0xc460c87630, 0xc460c79290, 0x3, 0x3, 0xc459368f40, 0xc4200d1401, 0xc42f899826, 0xc44b130ab0, 0x9de339)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1202 +0x2e3
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc42041c120, 0x1204d60, 0xc43fa36800, 0x4306ed9f406f6901, 0x6c912cf55a707f21, 0xc431bc0510, 0xc460c87630, 0xc460caaa20, 0xc460c79290, 0x3, ...)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:484 +0x8c
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x432ea8, 0x1164e70)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:706 +0xe6
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc460cc78c0, 0xc4608f8b60, 0xc460cae090)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8```

@bwplotka
Copy link
Member

bwplotka commented Mar 12, 2019

We can repro it internally as well

EDIT: Now it's gone. I think it's tight to particular query & block. We will keep trying.

@bwplotka
Copy link
Member

The problem is rather here:

https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1229:29

and particularly here: https://github.com/improbable-eng/thanos/blob/master/pkg/store/bucket.go#L1357:29

The issue is that the test cases are there (we might missing some?) so the bug is not really clear. It might indicate something similar like here: #335 so race condition (nothing obvious) or hidden OOM (failed to alloc). We will keep digging.

@bwplotka
Copy link
Member

bwplotka commented Apr 2, 2019

Status:

Which again, either suggests:

  • OOM (broken allocations?)
  • Malformed data from bucket -> I think we can do further PR to tests against those!

@GiedriusS
Copy link
Member Author

My 0.5.0 Thanos Store has been up for about ~4 weeks now on prod so I think this can be closed 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants