store: crashing after upgrade to 0.3.0 #829

R4scal · 2019-02-10T07:02:08Z

Hi

thanos, version 0.3.0 (branch: HEAD, revision: 837e9671737698bf1778a4a9abfebbf96117a0be)
  build user:       root@986454de7a63
  build date:       20190208-15:23:51
  go version:       go1.11.5

I have 24h storage in prometheus and thanos for long-term. After upgrade thanos to 0.3.0 querying interval more then 24 crashing tanhos store:

 level=debug ts=2019-02-10T06:54:56.452875653Z caller=bucket.go:653 msg="Blocks source resolutions" blocks=6 mint=1549608897000 maxt=1549781697000 lset="{environment=\"prod\",replica=\"A\",service=\"sys\"}" spans="Range: 1549497600000-1549771200000 Resolution: 0"
 panic: runtime error: slice bounds out of range
 goroutine 981 [running]:
 github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc05f438900, 0x127d1e0, 0xc003f29080, 0xc01ac84280, 0x4, 0x20, 0x0, 0x5f48c8305ecfa1d, 0x0, 0x0)
         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1573 +0x6d3
 github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0x4346e9, 0x11757b0)
         /go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1544 +0xb2
: github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc08bdd0c60, 0xc08bdd0b40, 0xc0686b08e0)
         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
 created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
         /go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xbe

The text was updated successfully, but these errors were encountered:

GiedriusS · 2019-02-10T08:51:29Z

Did it happen pre-0.3.0? Are you sure you have enough RAM in that box to execute this query? If you execute sysctl vm.overcommit_memory=2 (disables overcommitting) and perform the same action - what happens (I assume you run on Linux)?

R4scal · 2019-02-10T12:09:24Z

I'm try set sysctl -w vm.overcommit_memory=2, with no success. I have lot of free memory.
Also I'm try downgrade thanos on store node to 0.2.1 and it works fine, no crash on queries.

thomasriley · 2019-02-11T11:17:19Z

Also seeing the same panic with Thanos Store after upgrading to 0.3.0:

goroutine 978 [running]:
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc4218485a0, 0x11e87c0, 0xc4f9cbd280, 0xc45c5f6800, 0x2f, 0x100, 0x1, 0x1e9185281e8dc144, 0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1573 +0x6bf
github.com/improbable-eng/thanos/pkg/store.(*bucketChunkReader).preload.func3(0x0, 0x0)
	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1544 +0xab
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc43e7d8180, 0xc43e7d80c0, 0xc43f936510)
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
	/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

I can see that Store uses a fair amount of memory during bucket initialisation and then drops off to a more conservative usage. As you can see below it does not run out of memory at the moment it crashes (7.61GB / 20GB @ 11:08:30):

bwplotka · 2019-02-11T14:11:08Z

Hm.. the code path that is problematic looks exactly like this: #816

This means that we ask for more bytes in object storage and reader gives us less. We probably need some check anyway (like mentioned in discussion in linked ticket). But the overall state looks lile malformed block. Why would index point to non existsing bytes? Unless we have bug in posting code, which was touched recently.

This happens on particular block or all of them? How often?

R4scal · 2019-02-11T14:23:43Z

In my case all queries to remote storage in 0.3.0 crashes. I'm downgrade to 0.2.1 only store node and all working fine now

PsychoSid · 2019-02-11T14:25:44Z

I have also downgraded by storage processes to 0.2.1 and all is fine now (have left query, and compactor running as 0.3.0)

domgreen · 2019-02-11T15:23:09Z

Is the block that this is happening a partially uploaded block?

From the code, this should only happen if we are trying to get data that we expect would be in the block but has not been written.

R4scal · 2019-02-11T17:09:35Z

No it's not. I don't have partially uploaded blocks (at least on the sidecar or compact logs). But in one s3 bucket (local minio cluster) I have blocks from multiple prometheus with different tags (replica,dc,service) and store queries to all of them failed on thanos 0.3.0, but it work fine on version 0.2.1.
I downgrade to 0.2.1 only store service. Sidecar, query and compact are working on 0.3.0.

PsychoSid · 2019-02-11T17:38:43Z

What @R4scal says almost exactly mirrors my issue as well. Although I had 2 separate environments (different buckets etc) and 0.3.0 failed store queries so I downgraded that to 0.2.1 and all ok. Everything else is at 0.3.0. I can switch versions at will really if anything needs testing.

bwplotka · 2019-02-11T19:37:34Z

Yes would be nice to know, if particular block is wrong or just all. If 0.2.1 works then it seems like it has to be something with upgrade of tsdb and posting refactor we had ):

…

On Mon, Feb 11, 2019, 12:38 Paul Seymour ***@***.***> wrote: What @R4scal <https://github.com/R4scal> says almost exactly mirrors my issue as well. Although I had 2 separate environments (different buckets etc) and 0.3.0 failed store queries so I downgraded that to 0.2.1 and all ok. Everything else is at 0.3.0. I can switch versions at will really if anything needs testing. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#829 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGoNuyRp7o2UkFQSf-7v1b-Ys6Y0PZp3ks5vMaqlgaJpZM4ay7M7> .

PsychoSid · 2019-02-11T20:04:55Z

Doesn't seem tied to a particular block. But it's hard to say for sure. Anything I can run to point to ? Running it in debug shows a bunch but nothing to indicate a problem with any of them.

bwplotka · 2019-02-12T02:07:12Z

I think this is related to this change: #753

bwplotka · 2019-02-12T03:44:41Z

Important question. What queries are you doing exactly?

R4scal · 2019-02-12T07:01:16Z

Example queries that crash thanos 0.3.0 from grafana in my case:

telegraf_internal_gather_metrics_gathered{input="disk",environment="$env",ms="$ms",service="mon"}
rate(net_bytes_recv{ms="$ms",host=~"$host", interface=~"bond[0-9]+$",environment="$env",service="sys"}[$inter])*8

Moved bucket e2e tests to table test. Signed-off-by: Bartek Plotka <[email protected]>

bwplotka · 2019-02-13T04:45:14Z

Thanks for all info! In just couple of days after release we found out (thanks to you guys) and hopefully fixed this: #837

(: We need to fix some issues with negative matcher and then we will do patch release to add this.

* setting the start and end to prior posting changes * really need some tests data but this may also be the fix * moving the start and end inside the loop, so they are not updated as we iterate over items * Added regressions tests for #829. Moved bucket e2e tests to table test. Signed-off-by: Bartek Plotka <[email protected]> * Fixed overestimation for fetching chunks and series. Signed-off-by: Bartek Plotka <[email protected]> * Removed wrong comment. Signed-off-by: Bartek Plotka <[email protected]> * changing func to match interface

bwplotka · 2019-02-14T13:13:19Z

Fixed by this: #837 (:

GiedriusS mentioned this issue Feb 10, 2019

Compactor Retention Issue and Resource Usage #824

Closed

bwplotka added the bug label Feb 11, 2019

bwplotka mentioned this issue Feb 12, 2019

store: Querying with regex label matchers return invalid metrics in version 0.3.0 #833

Closed

bwplotka added a commit that referenced this issue Feb 13, 2019

Added regressions tests for #829.

9c31327

Moved bucket e2e tests to table test. Signed-off-by: Bartek Plotka <[email protected]>

bwplotka mentioned this issue Feb 13, 2019

store: Setting the start and end to prior posting changes #837

Merged

bwplotka closed this as completed Feb 14, 2019

mreichardt95 mentioned this issue Mar 4, 2019

Query node hitting storage node causes storage node to panic #878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: crashing after upgrade to 0.3.0 #829

store: crashing after upgrade to 0.3.0 #829

R4scal commented Feb 10, 2019

GiedriusS commented Feb 10, 2019 •

edited

Loading

R4scal commented Feb 10, 2019

thomasriley commented Feb 11, 2019

bwplotka commented Feb 11, 2019

R4scal commented Feb 11, 2019

PsychoSid commented Feb 11, 2019

domgreen commented Feb 11, 2019

R4scal commented Feb 11, 2019

PsychoSid commented Feb 11, 2019

bwplotka commented Feb 11, 2019 via email

PsychoSid commented Feb 11, 2019

bwplotka commented Feb 12, 2019

bwplotka commented Feb 12, 2019

R4scal commented Feb 12, 2019

bwplotka commented Feb 13, 2019 •

edited

Loading

bwplotka commented Feb 14, 2019

store: crashing after upgrade to 0.3.0 #829

store: crashing after upgrade to 0.3.0 #829

Comments

R4scal commented Feb 10, 2019

GiedriusS commented Feb 10, 2019 • edited Loading

R4scal commented Feb 10, 2019

thomasriley commented Feb 11, 2019

bwplotka commented Feb 11, 2019

R4scal commented Feb 11, 2019

PsychoSid commented Feb 11, 2019

domgreen commented Feb 11, 2019

R4scal commented Feb 11, 2019

PsychoSid commented Feb 11, 2019

bwplotka commented Feb 11, 2019 via email

PsychoSid commented Feb 11, 2019

bwplotka commented Feb 12, 2019

bwplotka commented Feb 12, 2019

R4scal commented Feb 12, 2019

bwplotka commented Feb 13, 2019 • edited Loading

bwplotka commented Feb 14, 2019

GiedriusS commented Feb 10, 2019 •

edited

Loading

bwplotka commented Feb 13, 2019 •

edited

Loading