Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Compact gets internal server error, when is facing context canceled from Storage Provider #4689

Closed
nicolastakashi opened this issue Sep 21, 2021 · 10 comments

Comments

@nicolastakashi
Copy link
Contributor

nicolastakashi commented Sep 21, 2021

Thanos, Prometheus and Golang version used:

Thanos Version 0.22
Prometheus: 2.27

Object Storage Provider: Microsoft Azure

What happened: Thanos Compact gets internal server error, when is facing context canceled from Storage Provider

What you expected to happen: Retry an keep health

How to reproduce it (as minimally and precisely as possible): There is no a specific scenario, I just start to face this error after update to version 0.22

Full logs to relevant components:

level=info ts=2021-09-21T13:37:30.172907505Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2021-09-21T13:37:30.679158689Z caller=compact.go:556 msg="starting compact node"
level=info ts=2021-09-21T13:37:30.679202292Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2021-09-21T13:37:30.679450812Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2021-09-21T13:37:30.679477414Z caller=http.go:63 service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2021-09-21T13:37:30.679994754Z caller=tls_config.go:191 service=http/server component=compact msg="TLS is disabled." http2=false
level=info ts=2021-09-21T13:37:30.680282777Z caller=compact.go:948 msg="start sync of metas"
level=info ts=2021-09-21T13:40:49.566485486Z caller=main.go:167 msg="caught signal. Exiting." signal=terminated
level=warn ts=2021-09-21T13:40:49.56655559Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2021-09-21T13:40:49.566569191Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err=null
level=info ts=2021-09-21T13:40:50.065939201Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
level=info ts=2021-09-21T13:40:50.065987104Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
level=error ts=2021-09-21T13:40:50.27341152Z caller=compact.go:488 msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context canceled"
level=error ts=2021-09-21T13:40:50.273894348Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: cannot list hierarchy blobs with prefix  (iteration #1): Get \"https://storage.blob.core.windows.net/container?comp=list&delimiter=%2F&restype=container&timeout=61\": context canceled"
level=info ts=2021-09-21T13:40:50.273937051Z caller=main.go:159 msg=exiting

Anything else we need to know:

@Aaron-ML
Copy link

I'm seeing something similar to this on 0.22 as well

level=info ts=2021-09-14T00:42:00.064565662Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=805.474531ms cached=1284 returned=1284 partial=0
level=warn ts=2021-09-14T00:42:03.017083892Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2021-09-14T00:42:03.017118393Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err=null
level=error ts=2021-09-14T00:42:03.017273796Z caller=compact.go:488 msg="retriable error" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json: cannot get blob reader: 01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json: cannot get properties for Azure blob, address: 01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json: Head \"https://<redacted>.blob.core.windows.net/thanos/01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json?timeout=61\": dial tcp 52.241.88.36:443: connect: cannot assign requested address"
level=info ts=2021-09-14T00:42:03.019124632Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
level=info ts=2021-09-14T00:42:03.019148232Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
level=info ts=2021-09-14T00:42:03.019185833Z caller=main.go:159 msg=exiting

This doesn't happen on 0.21, very odd.

@hanjm
Copy link
Member

hanjm commented Sep 26, 2021

By default thanos compact will run to completion which makes it possible to execute in a cronjob. Using the arguments --wait and --wait-interval=5m it’s possible to keep it running.

https://thanos.io/tip/components/compact.md/#compactor

@Aaron-ML
Copy link

Aaron-ML commented Sep 27, 2021

By default thanos compact will run to completion which makes it possible to execute in a cronjob. Using the arguments --wait and --wait-interval=5m it’s possible to keep it running.

https://thanos.io/tip/components/compact.md/#compactor

@hanjm I'm already running the --wait flag and have for quite awhile, not sure if that changed in 0.22? This isn't running as a cronjob but rather a deployment.

  - args:
    - compact
    - --log.level=info
    - --log.format=logfmt
    - --http-address=0.0.0.0:10902
    - --objstore.config-file=/etc/config/object-store.yaml
    - --data-dir=/var/thanos/compact
    - --consistency-delay=30m
    - --retention.resolution-raw=30d
    - --retention.resolution-5m=120d
    - --retention.resolution-1h=1y
    - --block-sync-concurrency=20
    - --compact.concurrency=1
    - --wait

@nicolastakashi
Copy link
Contributor Author

I'm using the same strategy that @Aaron-ML is using and even on version 0.20.1, I'm facing errors that could be handled by retrying the operation.

level=info ts=2021-09-30T07:02:29.072136538Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2021-09-30T07:02:29.56787006Z caller=compact.go:539 msg="starting compact node"
level=info ts=2021-09-30T07:02:29.567916963Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2021-09-30T07:02:29.568176778Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2021-09-30T07:02:29.568213681Z caller=http.go:62 service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2021-09-30T07:02:29.568552701Z caller=compact.go:942 msg="start sync of metas"
level=info ts=2021-09-30T07:02:39.183579477Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=9.614692056s cached=348 returned=348 partial=0
level=info ts=2021-09-30T07:02:57.97161771Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=18.698576374s cached=348 returned=348 partial=0
level=info ts=2021-09-30T07:03:11.869691052Z caller=main.go:167 msg="caught signal. Exiting." signal=terminated
level=warn ts=2021-09-30T07:03:11.869753456Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2021-09-30T07:03:11.869770557Z caller=http.go:69 service=http/server component=compact msg="internal server is shutting down" err=null
level=info ts=2021-09-30T07:03:11.874936563Z caller=http.go:88 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
level=info ts=2021-09-30T07:03:11.874965165Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
level=error ts=2021-09-30T07:03:12.368622077Z caller=compact.go:468 msg="retriable error" err="compaction: sync: filter metas: filter blocks marked for no compaction: get file: 01G4E4NFP77YFEJES4E63TB07S/no-compact-mark.json: cannot get blob reader: 01F4E3NCP77YFEJES3E63TF07S/no-compact-mark.json: cannot get properties for Azure blob, address: 01F4E3NCP77YFEJES4E63TB07S/no-compact-mark.json: -> github.com/Azure/azure-pipeline-go/pipeline.NewError, /home/circleci/go/pkg/mod/github.com/!azure/[email protected]/pipeline/error.go:154\nHTTP request failed\n\nHead \"https://<storage>.blob.core.windows.net/container/01F4E3NCP77YFEJES4E63TB07S/no-compact-mark.json?timeout=61\": context canceled\n"
level=info ts=2021-09-30T07:03:12.369028801Z caller=main.go:159 msg=exiting

@ahurtaud
Copy link
Contributor

ahurtaud commented Oct 7, 2021

Can be a duplicate of #4466 which is still valid for me.
@nicolastakashi @Aaron-ML Can you try to run thanos tools bucket inspect ?
As per my comment on the other issue, it is working fine in 0.21.0 but failed since 0.22. Still valid in 0.23.1

@nicolastakashi
Copy link
Contributor Author

Hi @ahurtaud, sorry but I don't realize the relation of this issue with thanos tools bucket inspect, could please provide me a little more context?

@ahurtaud
Copy link
Contributor

@nicolastakashi I think some code is shared for listing buckets/meta.
I suspect this issue to be with the Azure API and I am seeing the same regression between 0.21 and 0.22+ for azure blob storage with thanos tools bucket inspect is an easy way to reproduce with the CLI.

So this might be the same root cause.

@nicolastakashi
Copy link
Contributor Author

Hey Peeps!
A quick update.
In the last week, I reviewed my pod resources requests and limits, and I've noticed many CPU throttling on the Thanos Compact Pod.

After I adjusted the resources requests and limits and Thanos Compact stopped being throttled, the Thanos Compact stopped receiving Timeout during Storage Communication.

@yeya24
Copy link
Contributor

yeya24 commented Oct 30, 2021

Is it good to close the issue now? @nicolastakashi

@nicolastakashi
Copy link
Contributor Author

Hi @yeya24 from my side it looks Ok!

@yeya24 yeya24 closed this as completed Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants