Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce store memory usage by reworking block.indexCache #1471

Closed
miekg opened this issue Aug 28, 2019 · 30 comments · Fixed by #1952
Closed

Reduce store memory usage by reworking block.indexCache #1471

miekg opened this issue Aug 28, 2019 · 30 comments · Fixed by #1952

Comments

@miekg
Copy link
Contributor

miekg commented Aug 28, 2019

See the umbrella bug on Store memory usage: #448

Lots of memory is being used to keep the indexCache in memory (this is in pkg/block/index.go).
Instead of having a fixed in-ram cache of every index file that is loaded, we could maintain an LRU cache of partial index files on disk. Index files are designed to be mmaped so we can leverage the kernel page cache for ram caching of this data. This would replace the existing LRU index caches, and let us delegate more of the work to the prometheus tsdb libraries which already support this operating mode.

/cc @asuffield (to flesh out some more, as he has spend more cycles on this currently).

@asuffield
Copy link

We get burned by having a relatively high number of label values over time, and the current index cache implementation pins all of these into ram, even though most are never used. We're up to nearly 40Gb of index.cache.json files now, and it grows over time because of continually changing label values. Prometheus itself doesn't mind because the number of unique values in any recent chunk of time is small enough to fit into ram, but thanos attempts to hold multiple copies of every value in history.

The ByteSlice interface in prometheus/tsdb would fit neatly onto a download-with-cache function, which would:

  • mmap a disk file for local backing store
  • if the index file is not already present, fetch part or all of it
  • return a byte slice into the mmaped file

This leans on the kernel page cache to do most of the ram juggling, and the thanos code can focus on managing disk usage. With that in place, I believe the index.cache.json files become unnecessary.

@bwplotka
Copy link
Member

bwplotka commented Aug 28, 2019

Awesome! I like this generally as well as attempt to reuse as much as possible (ByteSlice from TSDB) ❤️

Just to clarify:

Do you mean to directly hook into ByteSlice and the same for chunks into form of: ByteSlice -> Cache (mmap) -> Cache2 (Disk) -> ObjectStorage API. GetRange/Get?

This is great from the code complexity and readability standpoint, however, there are a couple of challenges, we might need to investigate first.

Before we touch those just one thing. It is super confusing and we should change the naming ASAP, but there are two index caches in Thanos 😄 It looks like @miekg @asuffield are on the same page but just for further readers of this issue, there are

  • index.cache.json - it probably should be called index entry point rather. This is data model that allows us to have a bare minimum to lookup/navigate through index.
  • indexCache - which is a in-memory cache (LRU map) that stores postings and series for each block. It gets filled during query time.

Challenges

  1. Most of the object storages do not like many small requests to tiny byte ranges. API is heavily rate-limited and super slow on high traffic cases, that's why we do optimizations like partitioning on query time:

    // ExpandedPostings returns postings in expanded list instead of index.Postings.
    That's why trivial ByteSlice cache where if there is a miss for a certain slice (byte range) we just fetch it from blob will potentially not work efficiently.

if the index file is not already present, fetch part or all of it

How do we know "which part" to fetch? With precomputed index.cache.json we knew the bare minimum, so I guess you mean fetching the whole index or just ONLY index.cache.json (so essential part of to even start lookup). I don't think downloading whole index in TSDB format makes sense, unless we query for everything which in the block (we probably should block such queries). We could just fetch part of the index, but how to store it outside of memory? I guess you mean here to define some separate format (flatbuffers? protobufs?) to maintain this as local file and mmap it? That "maintaining" will be hard as this file is no longer immutable.

  1. I think exactly the same will be for chunks/chunk files? Although you mention index only.

  2. We might want to add tombstones support at some point for some sort of sidecar: Handle or do not allow delete_series for Thanos with object store backup #415 Nothing designed yet, but worth to have this in mind while redesigning this part.

Overall I really love the direction, I think the "fall back" logic for trying first mmap-ed file, then mmap-file if present on local disk, then fetch things to mmap as the last resort makes sense. However, we might have need to do things like bundling object storage requests together, etc which we need to figure out.

Let me know if I am missing something and thanks for this issue 👍

@asuffield
Copy link

Yes, that sounds about right.

I've pondered some of those challenges, and I have some ideas, although I'm not certain about them:

I was thinking that the read-partitioning code could be moved into the ByteSlice cache. If we round up all read ranges to 4k page boundaries (so as not to annoy the kernel page cache) and combine reads across multiple concurrent queries then we should be able to achieve fewer GetRange calls than are currently made. This is the part where some experimentation is needed: I'm not sure if we'd need something like the Nagle algorithm to hold back short reads, or if we can just construct a read-combining work queue, accept that the very first thing sent will be inefficient, and rely on network latency to delay the rest long enough for the requests to be combined. A manual cork/uncork call around the prefetch functions might also work.

Since index.cache.json is constructed from specific sections of the index file, I was thinking that we can probably get rid of it entirely if we add a little precaching logic on the index files. Whenever a new local copy of an index file needs to be started, we would immediately fetch the TOC, then use it to fetch the symbol, label index, label offset, and postings offset sections. With these sections guaranteed to always be cached, we can then let the tsdb code do its usual thing and it should produce a small number of requests for the correct ranges of postings and series. For the copy of this file on disk, I have an idea for a variant of the index format which is precisely compatible:

All 4k pages before the TOC are byte-for-byte identical to the cached source file, padded with zeros to a 4k boundary. After this we insert a bitmap of which pages are present: the first bit in this section corresponds to the first 4k page, etc. 0 indicates that the file on disk is missing that page (which will be filled with zeros), 1 indicates that the page has been fetched. Then as we fetch byte ranges, we flip the relevant bits. The TOC is moved to be after this section. The tsdb format is flexible enough that the prometheus tsdb code will still consider this to be an equivalent file. We'd need to fsync before updating the bitmap, but that's easy enough. I wouldn't bother attempting to invert this logic for cache eviction - just delete entire files that haven't been touched recently, we expect most queries to be run over "the last X days" so blocks will naturally age out of the cache.

I expect the same concepts would work for chunk files - that isn't a problem we're currently having so I've not spent any time thinking about it, but it sounds reasonable.

@bwplotka
Copy link
Member

Sorry for a laggy response, also cc @krasi-georgiev @codesome @gouthamve , as you might find this thread interesting (:

read-partitioning code could be moved into the ByteSlice cache. If we round up all read ranges to 4k page boundaries (...) and combine reads across multiple concurrent queries

While page rounding and combine reads across multiple concurrent queries is nice to have, we IMO really want to "bundle" GetRange calls from single StoreAPI.Series request as well. I think the key misunderstanding might be here:

we can then let the tsdb code do its usual thing and it should produce a small number of requests for the correct ranges of postings and series

It's rarely a "small" number of requests. Especially for things like inverted matching job!="X" we fetch a lot of postings. Even if we round up to 4KB we would fetch from objstore to disk GBs in thousands of small requests 4KB each. We can be easily rate-limited and slow in this case. The cork/uncork logic makes sense, but the problem is that cache inside ByteSlice don't have enough context/possibility in delaying the actual fetch. It either fetches bytes or not (: If we would like to have something like LazyByteSlice (we somehow had this few months ago with lazyPostings) it is quite fragile and hard to debug. Also, that's why we needed to rewrite usage of this into this. This paritionin was possible only on the application level, layers above ByteSlice.

All 4k pages before the TOC are byte-for-byte identical to the cached source file, padded with zeros to a 4k boundary. After this we insert a bitmap of which pages are present: the first bit in this section corresponds to the first 4k page, etc. 0 indicates that the file on disk is missing that page (which will be filled with zeros),

Hm, I think this will work from TSDB readers standpoint. However, maybe I am missing something, but by 0 indicates that the file on disk is missing that page (which will be filled with zeros) you mean that missing page will be just an empty 4KB allocated space filled with zeroes, right? Then in order to just grab single postings for single series, one request will fetch 4KB for single posting and 4KB for single series from object storage (+initial entry point / content from previous index.cache.json) which is nice, but will use XX GBs of disk space with most of this being zeroes? 🤔 (as index file can be heavy). Also what you would mmap in this case? Whole XX GBs file as well? I might be missing some detail here.

@asuffield
Copy link

Ah, I can see a way to construct an API to make the cork/uncork approach work. If we make sure the relevant cachingByteSlice struct is accessible by a bucketBlock, then we give it an extra couple of functions:

func (cbs *cachingByteSlice) PrefetchRange(queryID string, start, end int) {...}
func (cbs *cachingByteSlice) RunPrefetches(queryID string) {...}

Then a function like PreloadSeries would call PrefetchRange on every interesting byte range, then call RunPrefetches once. The partitioning logic runs in there, and inserts suitably wide downloads into the queue. The tsdb code can then be allowed to run as usual, and all its called to ByteSlice should combine into one of those prefetches and wait for them to complete.

We'll still need some application-level code to assemble the prefetches, but that's mostly code which already exists.

For the on-disk file, there's an interesting detail worth noting about how linux reacts to being used this way. If you create an empty file and call truncate() to make it 500Gb in size, you get a sparse file - filled with zeros, and no disk blocks allocated. As you write to sections of the file, disk blocks get allocated to the parts which you use; for ext4 these are 4kb to align with pages. Reading from any of the unallocated sections will just return zeros. This does have the downside of potentially fragmenting the file, but that only really affects serial reads which we don't do.

You can then mmap the whole file, which consumes only virtual memory space, since it's backed by the disk (holes in the file are backed by the zero page). Actual data will sit in the kernel page cache if it can.

@bwplotka
Copy link
Member

bwplotka commented Aug 29, 2019

Awesome! I think generally I am happy with the ideas you have.

We'll still need some application-level code to assemble the prefetches, but that's mostly code which already exists.

Yup, but we need to be careful there as sometimes native TSDB helpers will be not suitable 👍

Thanks for the clarification on on-disk file and mmap - I was not fully aware of this. I think for Linux then it might work 👍

Now it's question what if someone uses totally different filesystem or ... NFS 😱 (That should be blocked/not alllowed) or FreeBSD/Windows/Mac.

I think it's fine to target Linux/Unix only, but we need to be explicit then. Any thoughts for other OSes/Filesystems?

@asuffield
Copy link

freebsd has sparse file support in ufs and zfs, macosx has it in apfs (although that's quite new). Windows doesn't have anything native, but I suspect the performance of something like thanos is never going to be great there. Probably the best bet on windows is to run it under WSL2.

Using NFS for a disk cache is a bad idea unless crafted by somebody who really understands NFS (and even then, it's probably still a bad idea). The goal is to store the files on local efficient storage; to do something sensible without local storage would take a different design - which I have some ideas for, but that's an unrelated problem space. It would sound more like "design a caching s3-like interface backed by a distant/expensive object store", which is something that will never need implementing in thanos (it would make sense as a standalone project and be useful to thanos and many other tools).

@bwplotka
Copy link
Member

👍

All looks good to me then (: Thanks.

@bwplotka
Copy link
Member

bwplotka commented Sep 4, 2019

@asuffield do you plan to work on this/implement this?

If yes, how would you see implementation process? E.g in terms of migration and iterating process. Should we aim for one - off switch or is there a way to deliver this gradually? I think we can just switch to the new one in certain release, but that might mean feature branch etc thoughts?

Feel free to join our CNCF slack for some more sync discussion. (:

@miekg
Copy link
Contributor Author

miekg commented Sep 4, 2019

We'll have a hard time allocating someone on help fixing this. In the intermediate we can work with the new --selector option in store to buy us some time. TL;DR: don't hold your breath until we come up with a PR.
(this doesn't negate the fact that we very much want to see this fixed/implemented)

@bwplotka
Copy link
Member

bwplotka commented Sep 4, 2019

Cool, marking as help-wanted. Happy to grab it at some point, but not sure when as well.

Looking for brave souls for that initiative, I can definitely help with review and design (:

@bwplotka
Copy link
Member

bwplotka commented Sep 4, 2019

Prioritizing selectors as well. Some initial discussion is here: #1245 (comment)

@zhulongcheng
Copy link
Contributor

@bwplotka I am interested in helping this (:

@bwplotka
Copy link
Member

bwplotka commented Sep 6, 2019

Awesome! Just to be aware: It's quite a major effort and requires lot's of time, but we might be able to split the work a bit. Are you on CNCF slack? It would be easier to sync. @zhulongcheng

@bwplotka
Copy link
Member

@zhulongcheng any update? Are you able to join us on Slack?

@zhulongcheng
Copy link
Contributor

Are you able to join us on Slack?

Yes.

I am writing a proposal doc, based on this issue. will new a PR after completed.

@zhulongcheng
Copy link
Contributor

zhulongcheng commented Sep 16, 2019

Work Plan:

  • A) Get file size from object storage
    Add a GetMeta method to the objstore.BucketReader interface.
    Ensure all object storage clients implement GetMeta method.

  • B) Implement BucketFileByteSlice (likes cachingByteSlice above)

  • C) Use BucketFileByteSlice for BucketIndexReader and BucketchunkReader
    bucketBlock.Preload -> … -> BucketFileByteSlice.Prefetch
    bucketBlock decode chunk/series -> … -> BucketFileByteSlice.Range

  • D) Remove the LRU index cache and chunkpool from bucketBlock.

@zhulongcheng
Copy link
Contributor

BucketFileByteSlice Architecture Overview:
image

@zhulongcheng
Copy link
Contributor

BucketFileByteSlice looks like:

// BucketFileByteSlice implements the ByteSlice interface of the Prometheus TSDB.
// Callers can use FileByteSlice just like ByteSlice.
// Typically, it can represent a block index file or 
// chunks file in remote object storage.
type BucketFileByteSlice struct {
	mtx   sync.RWMutex
	file  *os.File        // file descriptor
	data  []byte          // mmap bytes, shared & read only
	pages *roaring.Bitmap // record pages have been fetched
	size  int             // file size

	bkt  objstore.BucketReader
	name string // bucket object name
	path string // local file path
}

// Range returns subset of f.data.
// If some pages in the range have not been fetched, 
// It will fetch those pages from object storage.
func (f *BucketFileByteSlice) Range(start, end int) []byte {...}

// Len returns the file size
func (f *BucketFileByteSlice) Len() int {...}

// Prefetch fetchs bytes from object storage and then writes bytes to local file.
// It also updates the pages bitmap if the bytes was wrote successfully.
func (f *BucketFileByteSlice) Prefetch(start, end int) error {...}

// (TODO shuold remove local file?)
// Close resets pages bitmap, closes file descriptor, and unmaps local file.
func (f *BucketFileByteSlice) Close() error {...}

@bwplotka
Copy link
Member

bwplotka commented Sep 16, 2019

Nice, thanks for this @zhulongcheng 👍

First of all, can you @zhulongcheng start some proposal as a markdown page? Like we did for other bigger initiatives: https://github.com/thanos-io/thanos/tree/master/docs/proposals It will be much easier to comment on the PR with proposal in markdown (: What do you think? See this section from CONTRIBUTING doc for reference

Secondly, can we order this diagram somehow for better clarity? I assume It shows the sequence of actions done for each query against some block? Can we maybe make it rather in form of sequence diagram or pseudo code or at least show how actions look like exactly for query (StoreAPIs Series call)?

For example:

  • Store Gateway Series call starts
  • Match blocks (based on ext labels and time range) we know about thanks to periodic syncing:
    • Syncing stores only meta.json
  • For each matched block:
    • If NO index descriptor, fetch TOC, symbols and postings refs portions of index into disk, then mmap.
    • Figure out postings to fetch (make sure to combine multiple obj store requests into one)
    • Fetch postings to our sparse index file, make sure we have all mmaped
    • Determine series to fetch (again, in combined manner)
    • Fetch series refs to our sparse index file, make sure we have all mmaped
    • ..... do the similar chunks
      ... etc

Now this will give us some way to determine gaps. For example, how and when we ensure:

  • What to remove from memory (remove mmaps for chunks, whole index for blocks that are no longer used based on the ** memory limits** we gave)
  • What to remove from disk in same manner

Thirdly, it's not very clear to me why this is needed:

A) Get file size from object storage
Add a GetMeta method to the objstore.BucketReader interface.
Ensure all object storage clients implement GetMeta method.

Be aware that changing this is tricky as we now support multiple providers and we would need to add this for every of those. Avoiding this would be awesome (: If we can't avoid then we have no other options, but not sure yet if it's needed. Can you clarify? (:

Last but not least, some comments for BucketFileByteSlice

// Prefetch fetchs bytes from object storage and then writes bytes to local file.
// It also updates the pages bitmap if the bytes was wrote successfully.
func (f *BucketFileByteSlice) Prefetch(start, end int) error {...}

How does this combine multiple obj store request into one? Or split big ones into many as we do right now here. I guess something like proposed here: #1471 (comment) would nice.

// (TODO shuold remove local file?)
// Close resets pages bitmap, closes file descriptor, and unmaps local file.
func (f *BucketFileByteSlice) Close() error {...}

Depends when one would call this. I guess we might add separate explicit methods for: RemoveFromDisk Closemmap etc

Also, input from @asuffield would be awesome as we are starting to discuss design details. (:

@zhulongcheng
Copy link
Contributor

will move to a markdown page and add more details.

@domgreen
Copy link
Contributor

Just a heads up; Improbable will be looking to hire a contractor to specifically help with this task (hopefully in collaboration with others in the community) in October as it is something we are hitting ourselves. I will loop him in on this discussion and the design doc (@zhulongcheng please link once PR is open).

@ppanyukov
Copy link
Contributor

Greetings all, I am that contractor @domgreen mentioned. I'm here to work specifically on this issue, along with all other helping hands :)

I am on CNCF Slack for any insta-chats and discussions: https://slack.cncf.io/
Channel: #thanos-dev

@zhulongcheng, thanks for the detailed design proposal, looks very comprehensive! I will look at it in detail ASAP. Let's connect on CNCF Slack?

In the meantime, the broad plan is:

  • get a public repro of the issue
  • get this fixed in the way acceptable by the project and users
  • use the repro to confirm the fix

@asuffield
Copy link

For sake of repro, I can fill in some details as we've now root-caused the original problem that spawned this effort.

This is all about label value cardinality, and in particular a dimension which prometheus itself is less sensitive to: the cardinality of all stored values over time (prometheus only cares about the cardinality of values covered by a query, not the amount stored). Due to compaction cycles, the index.cache.json files in compacted blocks end up accumulating multiple lists of every value ever.

Our specific culprit turned out to be kube-state-metrics, and the pod metrics in particular. On any reasonably sized cluster this creates many new label values for every container created, and if these get stored into thanos then the amount of memory required for it to start inflates rapidly. A few months of history on a decent sized cluster worked out to 40Gb of label values.

@ppanyukov
Copy link
Contributor

@asuffield , this is great, do you want to connect on CNCF Slack https://slack.cncf.io/ to discuss?

@miekg
Copy link
Contributor Author

miekg commented Oct 2, 2019 via email

@ppanyukov
Copy link
Contributor

@bwplotka and I had a good discussion on this today and we came up with the test/repro plan which we think should help us move forward on this issue.

/cc: @domgreen


The repro/test plan:

  • have thanos-bench repo dedicated to benchmark testing

  • use k8s to run and orchestrate components:

    • (Minio) Data: minio+init
      • generate using CLI
      • add data to Minio
      • add some Thanos metadata to meta.json
    • Store Gateway: talks to Minio
    • Querier: talks to Store Gateway
    • Prometheus: will measure metrics like resources, latency etc

We will use this to repro OOM and other issues and measure impact of any proposed fixes.

@miekg
Copy link
Contributor Author

miekg commented Nov 20, 2019

I'm not following anything discussed on slack, but I'm of course curious to any updates and/or progress made here. Is there anything to share? Thanks!

@farvour
Copy link

farvour commented Feb 28, 2020

For sake of repro, I can fill in some details as we've now root-caused the original problem that spawned this effort.

This is all about label value cardinality, and in particular a dimension which prometheus itself is less sensitive to: the cardinality of all stored values over time (prometheus only cares about the cardinality of values covered by a query, not the amount stored). Due to compaction cycles, the index.cache.json files in compacted blocks end up accumulating multiple lists of every value ever.

Our specific culprit turned out to be kube-state-metrics, and the pod metrics in particular. On any reasonably sized cluster this creates many new label values for every container created, and if these get stored into thanos then the amount of memory required for it to start inflates rapidly. A few months of history on a decent sized cluster worked out to 40Gb of label values.

Yup same here. I've been watching these memory issues for awhile, and it seems like nobody really had a good explanation until I read this. Lo and behold, it seems like Improbable is finally experiencing the issues first hand and have finally decided that this mmap model is pretty bad (about time).

@bwplotka
Copy link
Member

bwplotka commented Feb 28, 2020

What mmap model exactly you mean? @farvour

Also why you say Improbable? At this point, Improbable does not have even any maintainer in Thanos Team.

Anyway, this piece was totally reworked and available in 0.11.1 RC: #1952

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants