Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed -querier.query-ingesters-within and -blocks-storage.tsdb.close-idle-tsdb-timeout defaults from 0 to 13h #967

Merged
merged 4 commits into from
Feb 1, 2022

Conversation

pracucci
Copy link
Collaborator

What this PR does:
In this PR I'm proposing to change the defaults of -querier.query-ingesters-within and -blocks-storage.tsdb.close-idle-tsdb-timeout both to 13h. I think this is a relatively safe change to do.

I've checked the docs/ and I think all mentions to -querier.query-ingesters-within and -blocks-storage.tsdb.close-idle-tsdb-timeout still make sense.

Out of scope of this PR:

  • I've intentionally not addressed -querier.query-store-after and -blocks-storage.bucket-store.ignore-blocks-within which I consider more sensitive and I would prefer to discuss separately.

Which issue(s) this PR fixes:
Fixes https://github.com/grafana/mimir-squad/issues/497

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@krajorama krajorama self-requested a review January 31, 2022 13:40
@krajorama
Copy link
Contributor

13h comes from ingester keeping data for 12h IIRC, so I'd mention that to show the dependency, or actually use some common constant, otherwise it's hard to understand why we chose exactly 13.

@pracucci
Copy link
Collaborator Author

13h comes from ingester keeping data for 12h IIRC, so I'd mention that to show the dependency, or actually use some common constant, otherwise it's hard to understand why we chose exactly 13.

We're used to keep blocks for a longer period on ingesters (in our jsonnet config), but your comment made me realize that the default is 6h, so we need to increase that too.

@pracucci pracucci marked this pull request as draft January 31, 2022 15:26
@pracucci pracucci marked this pull request as ready for review January 31, 2022 15:39
@pracucci pracucci force-pushed the change-distributor-remote-timeout-default-fixed branch from af58ee8 to 08cb356 Compare January 31, 2022 15:41
@krajorama
Copy link
Contributor

13h comes from ingester keeping data for 12h IIRC, so I'd mention that to show the dependency, or actually use some common constant, otherwise it's hard to understand why we chose exactly 13.

We're used to keep blocks for a longer period on ingesters (in our jsonnet config), but your comment made me realize that the default is 6h, so we need to increase that too.

So you're saying that 13h compares to the 24h retention time? Then that would mean that blocks-storage.tsdb.retention-period needs to be larger then the querier.query-ingesters-within ?

@@ -3172,6 +3172,7 @@ func TestIngester_closeAndDeleteUserTSDBIfIdle_shouldNotCloseTSDBIfShippingIsInP
cfg := defaultIngesterTestConfig(t)
cfg.LifecyclerConfig.JoinAfter = 0
cfg.BlocksStorageConfig.TSDB.ShipConcurrency = 2
cfg.BlocksStorageConfig.TSDB.CloseIdleTSDBTimeout = time.Nanosecond // We want it to be idle immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want it to be idle immediately, why not set it to 0?
I think the answer to that question ^^ should probably be in the comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 means disabled, so that's why I've set it to 1 nanosecond. I will update the comment.

Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@@ -73,13 +73,13 @@ func (cfg *Config) RegisterFlags(f *flag.FlagSet, logger log.Logger) {
cfg.StoreGatewayClient.RegisterFlagsWithPrefix("querier.store-gateway-client", f)
f.BoolVar(&cfg.Iterators, "querier.iterators", false, "Use iterators to execute query, as opposed to fully materialising the series in memory.")
f.BoolVar(&cfg.BatchIterators, "querier.batch-iterators", true, "Use batch iterators to execute query, as opposed to fully materialising the series in memory. Takes precedent over the -querier.iterators flag.")
f.DurationVar(&cfg.QueryIngestersWithin, "querier.query-ingesters-within", 0, "Maximum lookback beyond which queries are not sent to ingester. 0 means all queries are sent to ingester.")
f.DurationVar(&cfg.QueryIngestersWithin, "querier.query-ingesters-within", 13*time.Hour, "Maximum lookback beyond which queries are not sent to ingester. 0 means all queries are sent to ingester.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this too close to "close on idle" time (also 13h)? Can we run into a race when ingesters decide to close TSDB at the same time when queriers query the block?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this too close to "close on idle" time (also 13h)?

That's how we set our jsonnet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we also set -querier.query-store-after to 12h.

@pracucci
Copy link
Collaborator Author

pracucci commented Feb 1, 2022

So you're saying that 13h compares to the 24h retention time? Then that would mean that blocks-storage.tsdb.retention-period needs to be larger then the querier.query-ingesters-within ?

-blocks-storage.tsdb.retention-period is not strictly required to be >= -querier.query-ingesters-within even if doesn't make much sense keep querying ingesters for data that know for sure won't be there. However, it's important that -blocks-storage.tsdb.retention-period and -querier.query-ingesters-within are either disabled or >= -querier.query-store-after, which is what we have in the config description:

  # TSDB blocks retention in the ingester before a block is removed. This should
  # be larger than the -blocks-storage.tsdb.block-ranges-period,
  # -querier.query-store-after and large enough to give store-gateways and
  # queriers enough time to discover newly uploaded blocks.
  # CLI flag: -blocks-storage.tsdb.retention-period
  [retention_period: <duration> | default = 24h]

@pracucci pracucci force-pushed the change-distributor-remote-timeout-default-fixed branch from 08cb356 to 1324d33 Compare February 1, 2022 08:52
…e-idle-tsdb-timeout defaults from 0 to 13h

Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
Signed-off-by: Marco Pracucci <[email protected]>
@pracucci pracucci force-pushed the change-distributor-remote-timeout-default-fixed branch from d254e3b to 2bc1e64 Compare February 1, 2022 13:25
@pracucci
Copy link
Collaborator Author

pracucci commented Feb 1, 2022

13h comes from ingester keeping data for 12h IIRC, so I'd mention that to show the dependency, or actually use some common constant, otherwise it's hard to understand why we chose exactly 13.

We're used to keep blocks for a longer period on ingesters (in our jsonnet config), but your comment made me realize that the default is 6h, so we need to increase that too.

So you're saying that 13h compares to the 24h retention time? Then that would mean that blocks-storage.tsdb.retention-period needs to be larger then the querier.query-ingesters-within ?

Discussed offline with Krajo. He suggests to improve the documentation to mention that our defaults are based on usage patterns seen in real life. I will create an issue about it because I think the discussion involves whether we should change -querier.query-store-after too.

@pracucci pracucci enabled auto-merge (squash) February 1, 2022 13:28
@pracucci pracucci merged commit 3da17b6 into main Feb 1, 2022
@pracucci pracucci deleted the change-distributor-remote-timeout-default-fixed branch February 1, 2022 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants