-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce store memory usage by reworking block.indexCache #1471
Comments
We get burned by having a relatively high number of label values over time, and the current index cache implementation pins all of these into ram, even though most are never used. We're up to nearly 40Gb of index.cache.json files now, and it grows over time because of continually changing label values. Prometheus itself doesn't mind because the number of unique values in any recent chunk of time is small enough to fit into ram, but thanos attempts to hold multiple copies of every value in history. The ByteSlice interface in prometheus/tsdb would fit neatly onto a download-with-cache function, which would:
This leans on the kernel page cache to do most of the ram juggling, and the thanos code can focus on managing disk usage. With that in place, I believe the index.cache.json files become unnecessary. |
Awesome! I like this generally as well as attempt to reuse as much as possible (ByteSlice from TSDB) ❤️ Just to clarify: Do you mean to directly hook into ByteSlice and the same for chunks into form of: This is great from the code complexity and readability standpoint, however, there are a couple of challenges, we might need to investigate first. Before we touch those just one thing. It is super confusing and we should change the naming ASAP, but there are two index caches in Thanos 😄 It looks like @miekg @asuffield are on the same page but just for further readers of this issue, there are
Challenges
How do we know "which part" to fetch? With precomputed index.cache.json we knew the bare minimum, so I guess you mean fetching the whole index or just ONLY index.cache.json (so essential part of to even start lookup). I don't think downloading whole index in TSDB format makes sense, unless we query for everything which in the block (we probably should block such queries). We could just fetch part of the index, but how to store it outside of memory? I guess you mean here to define some separate format (flatbuffers? protobufs?) to maintain this as local file and mmap it? That "maintaining" will be hard as this file is no longer immutable.
Overall I really love the direction, I think the "fall back" logic for trying first mmap-ed file, then mmap-file if present on local disk, then fetch things to mmap as the last resort makes sense. However, we might have need to do things like bundling object storage requests together, etc which we need to figure out. Let me know if I am missing something and thanks for this issue 👍 |
Yes, that sounds about right. I've pondered some of those challenges, and I have some ideas, although I'm not certain about them: I was thinking that the read-partitioning code could be moved into the ByteSlice cache. If we round up all read ranges to 4k page boundaries (so as not to annoy the kernel page cache) and combine reads across multiple concurrent queries then we should be able to achieve fewer GetRange calls than are currently made. This is the part where some experimentation is needed: I'm not sure if we'd need something like the Nagle algorithm to hold back short reads, or if we can just construct a read-combining work queue, accept that the very first thing sent will be inefficient, and rely on network latency to delay the rest long enough for the requests to be combined. A manual cork/uncork call around the prefetch functions might also work. Since index.cache.json is constructed from specific sections of the index file, I was thinking that we can probably get rid of it entirely if we add a little precaching logic on the index files. Whenever a new local copy of an index file needs to be started, we would immediately fetch the TOC, then use it to fetch the symbol, label index, label offset, and postings offset sections. With these sections guaranteed to always be cached, we can then let the tsdb code do its usual thing and it should produce a small number of requests for the correct ranges of postings and series. For the copy of this file on disk, I have an idea for a variant of the index format which is precisely compatible: All 4k pages before the TOC are byte-for-byte identical to the cached source file, padded with zeros to a 4k boundary. After this we insert a bitmap of which pages are present: the first bit in this section corresponds to the first 4k page, etc. 0 indicates that the file on disk is missing that page (which will be filled with zeros), 1 indicates that the page has been fetched. Then as we fetch byte ranges, we flip the relevant bits. The TOC is moved to be after this section. The tsdb format is flexible enough that the prometheus tsdb code will still consider this to be an equivalent file. We'd need to fsync before updating the bitmap, but that's easy enough. I wouldn't bother attempting to invert this logic for cache eviction - just delete entire files that haven't been touched recently, we expect most queries to be run over "the last X days" so blocks will naturally age out of the cache. I expect the same concepts would work for chunk files - that isn't a problem we're currently having so I've not spent any time thinking about it, but it sounds reasonable. |
Sorry for a laggy response, also cc @krasi-georgiev @codesome @gouthamve , as you might find this thread interesting (:
While page rounding and
It's rarely a "small" number of requests. Especially for things like inverted matching
Hm, I think this will work from TSDB readers standpoint. However, maybe I am missing something, but by |
Ah, I can see a way to construct an API to make the cork/uncork approach work. If we make sure the relevant cachingByteSlice struct is accessible by a bucketBlock, then we give it an extra couple of functions: func (cbs *cachingByteSlice) PrefetchRange(queryID string, start, end int) {...}
func (cbs *cachingByteSlice) RunPrefetches(queryID string) {...} Then a function like PreloadSeries would call PrefetchRange on every interesting byte range, then call RunPrefetches once. The partitioning logic runs in there, and inserts suitably wide downloads into the queue. The tsdb code can then be allowed to run as usual, and all its called to ByteSlice should combine into one of those prefetches and wait for them to complete. We'll still need some application-level code to assemble the prefetches, but that's mostly code which already exists. For the on-disk file, there's an interesting detail worth noting about how linux reacts to being used this way. If you create an empty file and call truncate() to make it 500Gb in size, you get a sparse file - filled with zeros, and no disk blocks allocated. As you write to sections of the file, disk blocks get allocated to the parts which you use; for ext4 these are 4kb to align with pages. Reading from any of the unallocated sections will just return zeros. This does have the downside of potentially fragmenting the file, but that only really affects serial reads which we don't do. You can then mmap the whole file, which consumes only virtual memory space, since it's backed by the disk (holes in the file are backed by the zero page). Actual data will sit in the kernel page cache if it can. |
Awesome! I think generally I am happy with the ideas you have.
Yup, but we need to be careful there as sometimes native TSDB helpers will be not suitable 👍 Thanks for the clarification on on-disk file and mmap - I was not fully aware of this. I think for Linux then it might work 👍 Now it's question what if someone uses totally different filesystem or ... NFS 😱 (That should be blocked/not alllowed) or FreeBSD/Windows/Mac. I think it's fine to target Linux/Unix only, but we need to be explicit then. Any thoughts for other OSes/Filesystems? |
freebsd has sparse file support in ufs and zfs, macosx has it in apfs (although that's quite new). Windows doesn't have anything native, but I suspect the performance of something like thanos is never going to be great there. Probably the best bet on windows is to run it under WSL2. Using NFS for a disk cache is a bad idea unless crafted by somebody who really understands NFS (and even then, it's probably still a bad idea). The goal is to store the files on local efficient storage; to do something sensible without local storage would take a different design - which I have some ideas for, but that's an unrelated problem space. It would sound more like "design a caching s3-like interface backed by a distant/expensive object store", which is something that will never need implementing in thanos (it would make sense as a standalone project and be useful to thanos and many other tools). |
👍 All looks good to me then (: Thanks. |
@asuffield do you plan to work on this/implement this? If yes, how would you see implementation process? E.g in terms of migration and iterating process. Should we aim for one - off switch or is there a way to deliver this gradually? I think we can just switch to the new one in certain release, but that might mean feature branch etc thoughts? Feel free to join our CNCF slack for some more sync discussion. (: |
We'll have a hard time allocating someone on help fixing this. In the intermediate we can work with the new |
Cool, marking as Looking for brave souls for that initiative, I can definitely help with review and design (: |
Prioritizing selectors as well. Some initial discussion is here: #1245 (comment) |
@bwplotka I am interested in helping this (: |
Awesome! Just to be aware: It's quite a major effort and requires lot's of time, but we might be able to split the work a bit. Are you on CNCF slack? It would be easier to sync. @zhulongcheng |
@zhulongcheng any update? Are you able to join us on Slack? |
Yes. I am writing a proposal doc, based on this issue. will new a PR after completed. |
Work Plan:
|
|
Nice, thanks for this @zhulongcheng 👍 First of all, can you @zhulongcheng start some proposal as a markdown page? Like we did for other bigger initiatives: https://github.com/thanos-io/thanos/tree/master/docs/proposals It will be much easier to comment on the PR with proposal in markdown (: What do you think? See this section from CONTRIBUTING doc for reference Secondly, can we order this diagram somehow for better clarity? I assume It shows the sequence of actions done for each query against some block? Can we maybe make it rather in form of sequence diagram or pseudo code or at least show how actions look like exactly for query (StoreAPIs For example:
Now this will give us some way to determine gaps. For example, how and when we ensure:
Thirdly, it's not very clear to me why this is needed:
Be aware that changing this is tricky as we now support multiple providers and we would need to add this for every of those. Avoiding this would be awesome (: If we can't avoid then we have no other options, but not sure yet if it's needed. Can you clarify? (: Last but not least, some comments for
How does this combine multiple obj store request into one? Or split big ones into many as we do right now here. I guess something like proposed here: #1471 (comment) would nice.
Depends when one would call this. I guess we might add separate explicit methods for: Also, input from @asuffield would be awesome as we are starting to discuss design details. (: |
will move to a markdown page and add more details. |
Just a heads up; Improbable will be looking to hire a contractor to specifically help with this task (hopefully in collaboration with others in the community) in October as it is something we are hitting ourselves. I will loop him in on this discussion and the design doc (@zhulongcheng please link once PR is open). |
Greetings all, I am that contractor @domgreen mentioned. I'm here to work specifically on this issue, along with all other helping hands :) I am on CNCF Slack for any insta-chats and discussions: https://slack.cncf.io/ @zhulongcheng, thanks for the detailed design proposal, looks very comprehensive! I will look at it in detail ASAP. Let's connect on CNCF Slack? In the meantime, the broad plan is:
|
For sake of repro, I can fill in some details as we've now root-caused the original problem that spawned this effort. This is all about label value cardinality, and in particular a dimension which prometheus itself is less sensitive to: the cardinality of all stored values over time (prometheus only cares about the cardinality of values covered by a query, not the amount stored). Due to compaction cycles, the index.cache.json files in compacted blocks end up accumulating multiple lists of every value ever. Our specific culprit turned out to be kube-state-metrics, and the pod metrics in particular. On any reasonably sized cluster this creates many new label values for every container created, and if these get stored into thanos then the amount of memory required for it to start inflates rapidly. A few months of history on a decent sized cluster worked out to 40Gb of label values. |
@asuffield , this is great, do you want to connect on CNCF Slack https://slack.cncf.io/ to discuss? |
[ Quoting <[email protected]> in "Re: [thanos-io/thanos] Reduce store..." ]
@asuffield , this is great, do you want to connect on CNCF Slack https://slack.cncf.io/ to discuss?
he's at SREcon for the rest of the week, but what we've dump here is the gist -
there isn't a whole lot of extra context to give.
|
@bwplotka and I had a good discussion on this today and we came up with the test/repro plan which we think should help us move forward on this issue. /cc: @domgreen The repro/test plan:
We will use this to repro OOM and other issues and measure impact of any proposed fixes. |
I'm not following anything discussed on slack, but I'm of course curious to any updates and/or progress made here. Is there anything to share? Thanks! |
Yup same here. I've been watching these memory issues for awhile, and it seems like nobody really had a good explanation until I read this. Lo and behold, it seems like Improbable is finally experiencing the issues first hand and have finally decided that this mmap model is pretty bad (about time). |
See the umbrella bug on Store memory usage: #448
Lots of memory is being used to keep the indexCache in memory (this is in pkg/block/index.go).
Instead of having a fixed in-ram cache of every index file that is loaded, we could maintain an LRU cache of partial index files on disk. Index files are designed to be mmaped so we can leverage the kernel page cache for ram caching of this data. This would replace the existing LRU index caches, and let us delegate more of the work to the prometheus tsdb libraries which already support this operating mode.
/cc @asuffield (to flesh out some more, as he has spend more cycles on this currently).
The text was updated successfully, but these errors were encountered: