-
Notifications
You must be signed in to change notification settings - Fork 797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support out of order samples ingestion #4964
Conversation
Several open questions:
We can follow up in next prs if needed. Updated: we decided to keep 2 and we already implemented reloadable for 3. |
# [EXPERIMENTAL] Configures the maximum capacity for out-of-order chunks (in | ||
# samples). If set to <=0, default value 32 is assumed. | ||
# CLI flag: -blocks-storage.tsdb.out-of-order-cap-max | ||
[out_of_order_cap_max: <int> | default = 32] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels weird to me. Shall we extract TSDB configs from block storage section? Having them in querier and SG is strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think querier comonent doesn't carray about this field, Is this due to the use of automatic template document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think querier and store gateway uses some configs from block storage.
But tsdb configs shouldn't be related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If out of order is enabled, users should not care about the configuration here (the burden of configuration is too heavy), unless there is a maximum capacity limit (the default is the largest).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are talking about the ooo capacity, I don't think this change would introduce additional burden to them.
This is a global value, not per tenant.
1、ooo will affect ingester's performance, it seems per tenant's |
The OOO time window is also per tenant so using this configuration is almost the same. The only difference is that distributor can drop samples early to avoid affecting ingesters. |
If there is no additional burden, there is absolutely no need to add more restrictions, right? |
docs/blocks-storage/querier.md
Outdated
@@ -878,4 +878,9 @@ blocks_storage: | |||
# will be stored. 0 or less means disabled. | |||
# CLI flag: -blocks-storage.tsdb.max-exemplars | |||
[max_exemplars: <int> | default = 0] | |||
|
|||
# [EXPERIMENTAL] Configures the maximum capacity for out-of-order chunks (in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest following reword:
Configures the maximum number of samples that can be out-of-order. See [some link] on how out of order works.
I think we somehow need to link to a document that talks about https://docs.google.com/document/d/1Kppm7qL9C-BJB1j6yb6-9ObG3AbdZnFUBYPNNWwDBYM/edit and prometheus/prometheus#11075 because of the experimental feature.
I am thinking we should have a doc in cortexmetrics.io
about OOO support talking about some operational implications. WDYT? I am more than happy to work on the documentation or OOO support.
The documentation is non-blocking for this PR, just bring this up as point of discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if set to <=0, default value 32 is assumed.
is this what Prometheus do? My preference is not to do this. I would prefer my application fail to start if I configure some nonsense value because its more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configures the maximum number of samples that can be out-of-order.
This looks good. I think I will also mention this per chunk.
Yeah I love the idea of having a doc about this feature for sure.
And I added a validation to make sure it is > 0.
@@ -2587,14 +2587,6 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s | |||
# CLI flag: -validation.max-metadata-length | |||
[max_metadata_length: <int> | default = 1024] | |||
|
|||
# Reject old samples. | |||
# CLI flag: -validation.reject-old-samples | |||
[reject_old_samples: <boolean> | default = false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per https://cortexmetrics.io/docs/configuration/v1guarantees/#flags-config-and-minor-version-upgrades we might need to keep the reject_old-*
config for 2 minor releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will add them back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering about the behaviour of the co-existence of reject_old_samples
and out_of_order_time_window
.
By default reject_old_samples=false
but out_of_order_time_window=0
, one config says accept old (out of order) samples, the other says disable out of order samples. What should Cortex do?
Generally speaking, what happens if the two config seemingly conflicts with each other? May be worth to document the behaviour in v1-guarantees.md
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the default values, if the sample is too old it will be rejected by the TSDB anyway if we don't enable OOO. This is the same even if we don't have OOO support so we don't change behavior here.
Yeah I could document the behavior. Basically I think reject_old_samples
happens only on the distributor side. OOO window is totally an ingester thing. So if users want to make OOO to work they need to adjust their configs for reject_old_samples
to allow old samples coming to ingester
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I feel instead of adding this to v1-guarantees.md
, it is better to document this to the Out of order samples operational doc you mentioned?
Wdyt? v1-guarantees.md
seems just list flags we have, but nothing really detailed about the settings and usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure if we should deprecate this flag....
Could i configure i can accept our of order for 5 min BUT if not out of order i can accept samples that is 2 hours old? I know we wanna simplify the config but seems different things no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could i configure i can accept our of order for 5 min BUT if not out of order i can accept samples that is 2 hours old?
Isn't it doable with only OOO? Non out of order samples are always accepted unless it is outside head time range.
But I agree they are still slightly different things since we can drop samples early on distributors rather than waiting till ingesters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @alanprot, we decided to not deprecate the two flags as they are different from the OOO settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update https://cortexmetrics.io/docs/configuration/v1guarantees/#experimental-features with the experimental flags as part of this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments.
I also added #4990 so we remember to remove the deprecated flags in later release.
@@ -2587,14 +2587,6 @@ The `limits_config` configures default and per-tenant limits imposed by Cortex s | |||
# CLI flag: -validation.max-metadata-length | |||
[max_metadata_length: <int> | default = 1024] | |||
|
|||
# Reject old samples. | |||
# CLI flag: -validation.reject-old-samples | |||
[reject_old_samples: <boolean> | default = false] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering about the behaviour of the co-existence of reject_old_samples
and out_of_order_time_window
.
By default reject_old_samples=false
but out_of_order_time_window=0
, one config says accept old (out of order) samples, the other says disable out of order samples. What should Cortex do?
Generally speaking, what happens if the two config seemingly conflicts with each other? May be worth to document the behaviour in v1-guarantees.md
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can out of order samples create bad results if a query is cached?
after a few tests we need a page to clarify expectations to users.
This is a follow up in the query frontend to allow users to specify a non-cacheable time window. |
5ad76ea
to
a9cb490
Compare
When will this PR be merged? Very much looking forward to this new feature. |
5159af1
to
b6b6d5f
Compare
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It's so clean and understandable.
Just one tiny nit.
cortex/pkg/ingester/ingester.go Line 1908 in a63bbb0
It may be necessary to consider adapting repeated compacted blocks uploads to S3. After enabling it, we found that block with a level greater than 1 was ignored and uploaded. |
d219692
to
463dc13
Compare
0de7e49
to
e7be020
Compare
Signed-off-by: Ben Ye <[email protected]>
Signed-off-by: Ben Ye [email protected]
What this PR does:
out_of_order_time_window
to limits so that each tenant can configure their own OOO time window.out_of_order_cap_max
in the ingester configuration. This is not a per-tenant configuration.Which issue(s) this PR fixes:
Fixes #4895
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]