Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Compact Crashing due to Azure Timeout #3952

Closed
airkewld opened this issue Mar 23, 2021 · 8 comments
Closed

Thanos Compact Crashing due to Azure Timeout #3952

airkewld opened this issue Mar 23, 2021 · 8 comments
Labels

Comments

@airkewld
Copy link

Compact component crashes when it encounters a file it cannot fully download/analyze.

level=info ts=2021-03-23T14:01:38.498832601Z caller=http.go:84 service=http/server component=compact msg="internal server is shutdown gracefully" err="syncing metas: incomplete view: meta.json file exists: 01F0DSQC0MQ07MBCFXSWD62NCG/meta.json: cannot get properties for Azure blob, address: 01F0DSQC0MQ07MBCFXSWD62NCG/meta.json: -> github.com/Azure/azure-pipeline-go/pipeline.NewError, /go/pkg/mod/github.com/!azure/[email protected]/pipeline/error.go:154\nHTTP request failed\n\nHead \"https://obj_store_name.blob.core.windows.net/thanos/01F0DSQC0MQ07MBCFXSWD62NCG/meta.json?timeout=1\": context deadline exceeded\n"

Our obj store config includes a max_retries value of 3, but it does not seem to be passed in correctly.

/ # cat /proc/1/cmdline
/bin/thanoscompact--wait--log.level=debug--compact.concurrency=1--log.format=logfmt--objstore.config=type: AZURE
config:
  storage_account: "storage_account"
  storage_account_key: "storage_account_key"
  container: "container_name"
  endpoint: "blob.core.windows.net"
  max_retries: 3
--data-dir=/var/thanos/compact--debug.accept-malformed-index--retention.resolution-raw=30d--retention.resolution-5m=90d--retention.resolution-1h=180d--delete-delay=48h--deduplication.replica-label=prometheus_replica--deduplication.replica-label=rule_replica--tracing.config="config":
  "sampler_param": 2
  "sampler_type": "ratelimiting"
  "service_name": "thanos-compact"
"type": "JAEGER"/ #

Any assistance is appreciated.

@wiardvanrij
Copy link
Member

I'm pretty sure we need to add a timeout value to the retry pipeline part here;

retryOptions := blob.RetryOptions{

See: https://pkg.go.dev/github.com/Azure/azure-storage-blob-go/azblob#RetryOptions

// TryTimeout indicates the maximum time allowed for any single try of an HTTP request.
	// A value of zero means that you accept our default timeout. NOTE: When transferring large amounts
	// of data, the default TryTimeout will probably not be sufficient. You should override this value
	// based on the bandwidth available to the host machine and proximity to the Storage service. A good
	// starting point may be something like (60 seconds per MB of anticipated-payload-size).
	TryTimeout time.Duration

I will try to make a PR soon™

@airkewld
Copy link
Author

Re: but it does not seem to be passed in correctly
In the error logs, i see the timeout set to 1, i expected that timeout value to be that of the max_retries specified in the obj store config file.
SQC0MQ07MBCFXSWD62NCG/meta.json?timeout=1\":

@stale
Copy link

stale bot commented Jun 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 2, 2021
@wiardvanrij
Copy link
Member

Still working on this via PR

@stale stale bot removed the stale label Jun 3, 2021
@stale
Copy link

stale bot commented Aug 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 2, 2021
@stale
Copy link

stale bot commented Aug 17, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Aug 17, 2021
@phoenixking25
Copy link

Any update here?

@wiardvanrij
Copy link
Member

It's implemented via #3970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants