Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query memory limits #1747

Closed

Conversation

ppanyukov
Copy link
Contributor

@ppanyukov ppanyukov commented Nov 14, 2019

This is a draft implementation of Query Memory Limits proposal (#1746).

Things to note:

  • Things are counted in quite substantial loops.
  • Ideally I'd like to have this feature on the basis "you don't use it, you don't pay for it".
  • However that might not possible to remove the cost entirely.
  • The cost of counting is likely to be a fraction in the overall cost of queries.

Steps I've taken to minimise any performance impact:

  • Using buffered counters, especially to avoid excessive calls to atomic.AddInt64. The costs of using atomic package across multiple goroutines can be exceptionally high.
  • Minimise function calls by inlining where possible and adding to raw local variables not shared between goroutines.
  • Approximating the query sizes where possible:
    • Querier: to avoid iterating over labels and chunks;
    • Store Gateway: to avoid excessive increments to counters as loops are quite substantial.
  • Overall I couldn't see any measurable negative impact in my one-machine environment.

If the above steps are deemed insufficient and we really want to not count anything if the feature is not used, we can extract bits into a func an set it to empty if feature is not enabled. But what is more expensive? To do a few int64 additions or make a function call? :)

Surfacing the feature to users:

  • Currently using env vars THANOS_LIMIT_QUERY_PIPE and THANOS_LIMIT_QUERY_TOTAL as it is least intrusive.
  • By default limits are not enforced, although the statistics is still provided via debug logger.

To run with limits:

THANOS_LIMIT_QUERY_PIPE=1200000000 THANOS_LIMIT_QUERY_TOTAL=1200000000 ./thanos

This will show things like this:

// Store Gateway
level=debug ts=2019-11-21T14:35:47.267444Z caller=limit.go:29 THANOS_LIMIT_QUERY_PIPE=500.00MB THANOS_LIMIT_QUERY_TOTAL=1.20GB
level=debug ts=2019-11-21T14:35:47.703926Z caller=bucket.go:678 queryTotalSize=9.36MB queryLocalSize=9.36MB
level=debug ts=2019-11-21T14:35:47.711417Z caller=bucket.go:678 queryTotalSize=18.72MB queryLocalSize=9.36MB
level=debug ts=2019-11-21T14:35:47.745473Z caller=bucket.go:678 queryTotalSize=28.08MB queryLocalSize=9.36MB
level=debug ts=2019-11-21T14:35:47.751189Z caller=bucket.go:678 queryTotalSize=37.44MB queryLocalSize=9.36MB
level=debug ts=2019-11-21T14:35:48.626972Z caller=bucket.go:678 queryTotalSize=89.36MB queryLocalSize=31.92MB
level=debug ts=2019-11-21T14:35:48.643825Z caller=bucket.go:678 queryTotalSize=101.28MB queryLocalSize=31.92MB
level=debug ts=2019-11-21T14:35:52.967735Z caller=bucket.go:678 queryTotalSize=643.61MB queryLocalSize=182.32MB
level=debug ts=2019-11-21T14:35:53.123464Z caller=bucket.go:678 queryTotalSize=645.92MB queryLocalSize=182.32MB
level=debug ts=2019-11-21T14:35:53.214819Z caller=bucket.go:678 queryTotalSize=648.24MB queryLocalSize=182.32MB
// Querier
level=debug ts=2019-11-21T15:05:38.760654Z caller=limit.go:29 THANOS_LIMIT_QUERY_PIPE=1.20GB THANOS_LIMIT_QUERY_TOTAL=1.20GB
level=debug ts=2019-11-21T15:17:16.325742Z caller=proxy.go:390 msg="THANOS_LIMIT_QUERY_TOTAL limit 1.20GB violated (got 1.20GB)"
level=debug ts=2019-11-21T15:17:16.325791Z caller=proxy.go:376 queryTotalSize=1200.03MB queryLocalSize=940.02MB
level=debug ts=2019-11-21T15:17:16.362089Z caller=proxy.go:390 msg="THANOS_LIMIT_QUERY_TOTAL limit 1.20GB violated (got 1.22GB)"
level=debug ts=2019-11-21T15:17:16.362125Z caller=proxy.go:376 queryTotalSize=1220.03MB queryLocalSize=280.01MB
level=debug ts=2019-11-21T15:17:16.362143Z caller=proxy.go:191 queryTotalSize=1220.03MB

Verification

The memory counters seem to be "good enough", at least if we trust pprof to tell us how many bytes we allocate :) Using instrumentation (which is now removed), we get these figures.

// Store Gateway - single block

WRITTEN HEAP DUMP TO /Users/philip/thanos/github.com/ppanyukov/thanos-oom/heap-sg-blockSeries-11-before.pb.gz
MEM STATS DIFF:   	sg-blockSeries 	sg-blockSeries - AFTER 	-> Delta
    HeapAlloc  : 	288.14M 	563.41M 		-> 275.27M
    HeapObjects: 	5.02M 		7.27M 			-> 2.26M

MEM PROF DIFF:    	sg-blockSeries 	sg-blockSeries - AFTER 	-> Delta
    InUseBytes  : 	233.15M 	425.31M 		-> 192.16M    <==|
    InUseObjects: 	847 		1.27K 			-> 426
    AllocBytes  : 	2.17G 		2.52G 			-> 350.76M
    AllocObjects: 	5.24K 		5.71K 			-> 464

WRITTEN HEAP DUMP TO /Users/philip/thanos/github.com/ppanyukov/thanos-oom/heap-sg-blockSeries-11-after.pb.gz
queryLocalSize: 180.25MB    <==|

// Store Gateway - query overall

MEM STATS DIFF:   	sg-Series 	sg-Series - AFTER 	-> Delta
    HeapAlloc  : 	275.09M 	1.13G 			-> 858.34M
    HeapObjects: 	4.90M 		12.70M 			-> 7.80M

MEM PROF DIFF:    	sg-Series 	sg-Series - AFTER 	-> Delta
    InUseBytes  : 	223.81M 	946.70M 		-> 722.89M    <==|
    InUseObjects: 	843 		2.63K 			-> 1.78K
    AllocBytes  : 	2.15G 		3.48G 			-> 1.33G
    AllocObjects: 	5.18K 		7.35K 			-> 2.16K
queryTotalSize: 642.51MB    <==|

// Querier - query overall

MEM STATS DIFF:   	q-Series 	q-Series - AFTER 	-> Delta
    HeapAlloc  : 	5.27M 		892.61M 		-> 887.34M
    HeapObjects: 	16.99K 		13.83M 			-> 13.81M

MEM PROF DIFF:    	q-Series 	q-Series - AFTER 	-> Delta
    InUseBytes  : 	4.24M 		654.24M 		-> 650.00M    <==|
    InUseObjects: 	12 		14.89K 			-> 14.88K
    AllocBytes  : 	7.19M 		1.62G 			-> 1.61G
    AllocObjects: 	29 		18.04K 			-> 18.01K

WRITTEN HEAP DUMP TO /Users/philip/thanos/github.com/ppanyukov/thanos-oom/heap-q-Series-1-after.pb.gz
queryTotalSize: 668.00MB    <==|

Note that Querier and SG broadly agree on the total query size.

Changelog

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Not done yet :)

Docker image

The latest build of this branch is on docker hub if anyone wants to give it a spin:

docker pull ppanyukov/thanos:qlimit

Signed-off-by: Philip Panyukov <[email protected]>
Signed-off-by: Philip Panyukov <[email protected]>
Signed-off-by: Philip Panyukov <[email protected]>
Signed-off-by: Philip Panyukov <[email protected]>
Signed-off-by: Philip Panyukov <[email protected]>
Signed-off-by: Philip Panyukov <[email protected]>
Signed-off-by: Philip Panyukov <[email protected]>
- there are tons on iterations
- why add overhead of a function call?
- not that it's going to save the world but still

Signed-off-by: Philip Panyukov <[email protected]>
Signed-off-by: Philip Panyukov <[email protected]>
@ppanyukov ppanyukov force-pushed the feature/CDATA-1163-query-limits branch from b2ea7c5 to c960b7b Compare November 21, 2019 13:32
@ppanyukov ppanyukov marked this pull request as ready for review November 21, 2019 15:34
@ppanyukov ppanyukov changed the title DRAFT: Add query memory limits Add query memory limits Nov 21, 2019
@ppanyukov
Copy link
Contributor Author

I think this is pretty much ready. If people could give some love to this PR it would be great :) @bwplotka ?

@ppanyukov
Copy link
Contributor Author

I've added THANOS_LIMIT_PROMQL_MAX_SAMPLES env var which is passed on to the PromQL engine as the limit.

However, I'm not sure this does anything much, or at least I couldn't find any sensible value for it beyond existing query limits.

This pathological query still uses tons of memory, despite all limits:

count({__name__=~".+"}) by (__name__)

@GiedriusS
Copy link
Member

I've added THANOS_LIMIT_PROMQL_MAX_SAMPLES env var which is passed on to the PromQL engine as the limit.

However, I'm not sure this does anything much, or at least I couldn't find any sensible value for it beyond existing query limits.

This pathological query still uses tons of memory, despite all limits:

count({__name__=~".+"}) by (__name__)

Probably due to the reasons I have mentioned here: #1369 (comment).

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what others think of this and I understand what you are trying to do but I'm not sure that I like this because the GC does not release the memory immediately so counting of these sizes might still lead to OOM situations and it feels more like tapering over the situation instead of solving it at its core. But, I understand that stopping the world on every Series() call or from time to time is also not the solution. We currently have per-samples limits and probably the next logical step would be to add some kind of limits in terms of label sets that a time series might have. But that probably needs to be elegantly solved somehow at the Prometheus level instead of here.

I'm not sure I like the implementation of how it is right now as well. We probably do not want to introduce any magical variables like this.

return parsedLimit
}

func byteCountToHuman(n int64) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I considered that, but I don't want to take an extra dependency for something which can be done in 10 lines of code trivially and used only in one place. Would you agree?


// buffer query sizes and process in chunks of 20MB or so
// to avoid hammering cache lines with atomic increments.
const querySizeBufferSize = 20 * 1000 * 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benchmarks to show that this is really optimal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really just a "finger in the air" figure. Arbitrary number. Reasoning being: we don't care if we go over the limit +/-20MB since we are talking about limits of 500MB+ in real scenarios.

I don't know what kind of benchmarks we can do to show this is "optimal". What is "optimal" in this case? I'm open to suggestions if someone has any better ideas.

// approximate the length of each label being about 20 chars, e.g. "k8s_app_metric0"
approxLabelLen = int64(10)

// approximate the size if chunks by having 120 bytes in each?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// approximate the size if chunks by having 120 bytes in each?
// approximate the size of chunks by having 120 bytes in each?


defer func() {
totalSizeMsg := fmt.Sprintf("%.2fMB", float64(queryTotalSize)/float64(1000000))
localSizeMsg := fmt.Sprintf("%.2fMB", float64(queryLocalSize)/float64(1000000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably reuse byteCountToHuman here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't want to expose byteCountToHuman as it's really an internal package detail, it doesn't make sense for the package to provide ByteCountToHuman I think.

If this bothers anyone, maybe indeed taking dep on humanize makes sense and then we can use it in both places. I don't know if benefits of extra dep outweight the cost of this though.

Not too fussed about either approach if this is what blocks this PR :)

@ppanyukov
Copy link
Contributor Author

... so counting of these sizes might still lead to OOM situations

Yes correct. There are other allocs (index, shared chunk pool) which are not controlled by these knobs.

... and it feels more like tapering over the situation instead of solving it at its core.

I see it more like "step in the right direction for having rate limits" and "experimental feature to see if it actually helps in real world". But maybe I do indeed solve this at the wrong level. What do you mean by "solving it at its core"? If you could elaborate a bit that'd be great!

We currently have per-samples limits and probably the next logical step would be to add some kind of limits in terms of label sets that a time series might have.

errrm, I'm not familiar with this! Is it the "chunk pool" things you are referring to? If so, I think this is slightly different? Or is it something completely different you meant?

But that probably needs to be elegantly solved somehow at the Prometheus level instead of here.

I'm not sure where at Prom level this would be solved as the whole thing seems to be Thanos-specific?

I'm not sure I like the implementation of how it is right now as well.

What not to like about this beautiful zero-abstractions carefully-crafted implementation? :) But seriously, any concrete dislikes are very welcome, I want to have happy consensus on the way this is implemented.

We probably do not want to introduce any magical variables like this.

Yes. Which ones? the 20MB one? The approximate sizes? All of them? I agree! I will take on board any better ideas if someone has them.


I'm very happy to discuss and chage things to make them better. It would be ideal if we could reach this:

  • Are we happy for this kind of approach in general?
  • If so, are we happy to expose this as experimental feature exposed via env vars?
  • If so, are we happy to iterate on this as we see real-world usage?
  • If there are any "no" answers, what do we need to do to turn then in to "yes" answers?

@stale
Copy link

stale bot commented Jan 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants