-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store: Add time & duration based partitioning #1077
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far! Worth to have some tests especially for tdv.
Good work!
6c373cc
to
dbfb28b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy that we're getting things in this area rolling. In the future I could also see sharding based on external labels in the block's metadata, for example, as even if we're limiting the time with this, there are other dimensions that are essentially unbound. Block selection + time selection is probably what we ultimately want, but this is a great start.
cmd/thanos/store.go
Outdated
minTime := store.TimeOrDuration(cmd.Flag("min-time", "Start of time range limit to serve. Option can be a constant time in RFC3339 format or time duration relative to current time, such as -1.5d or 2h45m. Valid duration units are ms, s, m, h, d, w, y."). | ||
Default("0000-01-01T00:00:00Z")) | ||
|
||
maxTime := store.TimeOrDuration(cmd.Flag("max-time", "End of time range limit to serve. Option can be a constant time in RFC3339 format or time duration relative to current time, such as -1.5d or 2h45m. Valid duration units are ms, s, m, h, d, w, y."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be more explicit whether the exact timestamp would be included or not. Maybe even an example in the docs of what a sane setup would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, some short docs would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM generally, some suggestions though.
id: id, | ||
dir: dir, | ||
} | ||
if err := b.loadMeta(ctx, id); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we load meta once? I think we do it everythere in small functions ):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should cache those on disk, not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b.loadMeta
already does this:
if _, err := os.Stat(b.dir); os.IsNotExist(err) {
if err := os.MkdirAll(b.dir, 0777); err != nil {
return errors.Wrap(err, "create dir")
}
src := path.Join(id.String(), block.MetaFilename)
level.Debug(b.logger).Log("msg", "download file", "bucket", b.bucket, "src", src, "dir", b.dir)
if err := objstore.DownloadFile(ctx, b.logger, b.bucket, src, b.dir); err != nil {
return errors.Wrap(err, "download meta.json")
}
} else if err != nil {
return err
}
meta, err := metadata.Read(b.dir)
if err != nil {
return errors.Wrap(err, "read meta.json")
}
pkg/store/flag_test.go
Outdated
) | ||
|
||
func TestTimeOrDurationValue(t *testing.T) { | ||
cmd := kingpin.New("test", "test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need kingpin for this test IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka but I really like it :P
I think it's and e2e test for flag parsing and as the package imports kingpin and has a special function for it, IMO it makes sense.
pkg/store/bucket.go
Outdated
} | ||
|
||
// We check for blocks in configured minTime, maxTime range | ||
if b.meta.MinTime < s.minTime.PrometheusTimestamp() || b.meta.MaxTime > s.maxTime.PrometheusTimestamp() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to compare against a blocks mintime, or maxtime, but not both. The scenario I have in mind is something like this:
- You have a block that covers 01:00:00 to 03:00:00.
- You have a
thanos-store
process covering 00:00:00 to 02:00:00 - You have a second
thanos-store
process covering 02:00:01 to 04:00:00
I think with the logic above, neither thanos-store process will load the block because it's mintime is inside the first thanos-store
process' range, but the maxtime is outside of its range. For the second thanos-store
you have the opposite situation. Since neither of them own the entire block, neither of them will serve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think essentially we need this:
if b.meta.MinTime < s.minTime.PrometheusTimestamp() || b.meta.MaxTime > s.maxTime.PrometheusTimestamp() { | |
if b.meta.MaxTime < s.minTime.PrometheusTimestamp() || b.meta.MinTime > s.maxTime.PrometheusTimestamp() { |
So:
Store range: |min --------------max|
Blocks: | ------ A--------| | -------- B --------|
Thanks for spotting this @claytono
Also can we have tests that would catch this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bwplotka I think in that scenario, we will end up with the same block served by two stores. For example:
Store A range: |min --- max|
Store B range |min --- max|
Store C range |min --- max|
Blocks: |----- A -----|----- B -----|----- C -----|
In this scenario, I think we want blocks A, B and C to be served by a single store. With my previous PR, I picked minTime as the determining factor, so with that PR the assignments would be like this:
- Block A -> Store A
- Block B -> Store A
- Block C -> Store B
I'm not sure there is a "right" answer for whether or not we should use minTime or maxTime for assigning blocks to stores, but I think in the interest of efficiency, if store time range are contiguous, then each block should only be owned by a single store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's fine in current version. i.e. users just need to be careful then setting right parameters.
So you set |minTime - maxTime|
and store only will respond with blocks that fit that range. Making thanos store returning blocks outside the range is IMO broken behavior & it will just confuse people as Thanos store will be returning blocks outside | mnTime - maxTime|
I'll add docs for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW there are tests for current behaviour. If we decide to change them we can refactor those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on what your use case is. If the implementation requires that blocks must be entirely within the time range for the store, it may be impossible to pick time ranges where all blocks will be handled by a store, without having overlapping time ranges, which has other efficiency issues.
Our primary desire for time based ranges is to be able to partition by time ranges, so the behavior as implemented now doesn't work well for us. Ss the compactor generates larger and larger blocks, overlap between blocks with different labels makes it more and more likely that you're going to be serving very large blocks from multiple stores. This means you'll be loading large blocks into memory in multiple places and wasting memory. For us, memory usage of the store processes is the primary cost for Thanos.
Perhaps this should be configurable if different people want different behavior.
docs/components/store.md
Outdated
By default Thanos Store Gateway looks at all the data in Object Store and returns it based on query's time range. | ||
You can shard Thanos Store gateway based on time or duration relative to current time. | ||
|
||
For example setting: `--min-time=-6w` & `--max-time=-2w` will make Thanos Store Gateway look at blocks that fall within `now - 6 weeks` up to `now - 2 weeks`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now being the time of start of the process or now being query time now? As far as I can tell it's only evaluated based on .InitialSync
. I feel like people are going to shoot themselves in the foot a lot with the duration based approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now being time of the sync
, we run syncs every 3m
. I'll update docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the docs and changed the implementation
Maybe we should actually do filtering based on just block's start time, for Thanos Store1 blocks would just constantly flow from Store1 to Store2, without gaps This would also work nicely with constant time partitioning. |
@povilasv Sorry, I didn't see this newer reply before replying to the older one. Filtering on just block start time would work well for us. |
I think looking at block's start time makes most sense for time based partitioning. I've changed the implementation and updated docs & tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for great work guys and sorry for delay.
TL;DR I am not sure if I agree with decision for using block startTime. More below in comments. (: I believe this is bit too confusing for end user and assuming too much. I think we should focus on shipping this in basic, non confusing form and focus on actually mitigating store mem consumption as per #448
Let me know if that makes sense. It's 1:30 AM my time 🤕
cmd/thanos/store.go
Outdated
@@ -51,11 +52,23 @@ func registerStore(m map[string]setupFunc, app *kingpin.Application, name string | |||
blockSyncConcurrency := cmd.Flag("block-sync-concurrency", "Number of goroutines to use when syncing blocks from object storage."). | |||
Default("20").Int() | |||
|
|||
minTime := model.TimeOrDuration(cmd.Flag("min-time", "Start of time range limit to serve. Thanos Store serves only blocks, which have start time greater than this value. Option can be a constant time in RFC3339 format or time duration relative to current time, such as -1.5d or 2h45m. Valid duration units are ms, s, m, h, d, w, y."). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all we should name it properly if we choose block min time and block max time. block-min-time
and same for max maybe? At some point we could have min-time
for have proper time partitioning. (all blocks that overlaps are synced and only exact time range is announced and returned). Also Start of time range limit to serve.
is not true then. As e.g 12:30 minTime and when block has startTime 12:29.. even though block finishes after 12:30 we don't serve metrics from it (right?)
However just to revisit the decision for block-time based partitioning instead of just time. Plus the fact that we take start time.
I have mixed feelings @claytono @povilasv
Against:
- If both max and min are inclusive or both exclusive you use case will not work in all cases. How you will compose chain of store gateways then?
- Blocks are just some format and it can change a lot for example during compaction. Also we don't assume any particular sync of those timestamps among different soruces (Prometheus). So someone can upload block (let's work on hours for simplicity):
11:30-13:30
and other can upload11:31
,13:31
. Now if you want to spread blocks evenly among store gateway this gets quite tricky. Setblock-min-time
to11:30
? this will include both blocks. However what next store gateway should be then?block-max-time=11:30
? This - How you will partition time when compactor is compacting blocks a lot? When those block time ranges are super dynamic?
- I find it extremely not intuitive to work on startTime only. So data newer than
block-min-time
are NOT served if they are in block with min time older thanblock-min-time
. I setblock-max-time
and still I get returned the samples AFTERblock-max-time
(even though we announce something via Info). That's essentially super confusing to use IMO. Maybe naming and documenting properly will help, but it's not great. - I think @claytono can achieve block paritioning by pure time based partitioning as well -> just make sure time range is not overlapping with not wanted blocks AFAIK
Pros:
- We already work on blocks on store gateway?
- We don't lookup the potentially big blocks that overlaps within single sample for some reason for given time partition (which should not that much of mem overhead IMO)
I think I partially disagree with #1077 (comment) last comment (cannot reply) from @claytono
We never load whole block to memory in store gateway. Sure, there is some certain baseline memory added for essential "index-cache", plus on each query if the time range overlaps with block we fetch matching postings and series and check overlaps against chunk itself. If we would load the whole block in memory we would not need such complex indexing AND for every query you would need TB of memory (: This being said IMO it's fine to sync extra one or more blocks just because there is some overlap.
Awesome thanks @povilasv and @claytono for very productive work here, but let's discuss it bit more @claytono as I think you are trying to workaround this #448 which might have many other, better and simplier solutions. Even though workaround for time being is better than being blocked, but those extra duplication in blocks on edges does not matter IMO. I would rather do proper time range parititioning as we intilially talked about and spent time and effort on #448 (as it is important!) as there might be huge wins by just optimizing, streaming more, limiting cardinality etc. Just to let to know Store memory consumption is known thing from very 0.1.0 and it's hitting (and costs) us all a lot, so we are actively seeking & improving it, but more hands on this would be preferable.
BTW have you tried to shard functionally instead of time based? Like store things in different buckets? What if we would add matchers for external labels that store would be exposing only (hypothetically) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all we should name it properly if we choose block min time and block max time. block-min-time and same for max maybe?
I can name it this way and add block-max-time
, not a problem.
(all blocks that overlaps are synced and only exact time range is announced and returned).
If both max and min are inclusive or both exclusive you use case will not work in all cases. How you will compose chain of store gateways then?
Overlaps are not a problem for Thanos, Thanos Querier will merge responses and dedup, so you just make overlaping chain of --block-min-time
& --block-max-time
.
Now if you want to spread blocks evenly among store gateway this gets quite tricky.
This is PR is not about spreading the blocks evenly. I don't really care about that and I don't really see the usecase of spreading blocks evenly.
How you will partition time when compactor is compacting blocks a lot? When those block time ranges are super dynamic?
Not an issue the sync job already adds new blocks which fit the time range and removes the old ones. So some times it will go into one Thanos Store, sometimes into another.
I find it extremely not intuitive to work on startTime only.
We can add 4 flags then: --block-start-time-min-time
, --block-start-time-max-time
, --block-end-time-min-time
, --block-end-time-max-time
I think @claytono can achieve block paritioning by pure time based partitioning as well -> just make sure time range is not overlapping with not wanted blocks AFAIK
I don't follow, why do we actually care that time range is not overlapping?
The idea is that we need to scale Thanos Store, it's impossible to keep low response times if we don't partition Thanos Store gateway. I.e. what happens to our LRU cache if someone queries for short time period <1w and then a few people on long >3mo, etc. When you shard you can optimize for that type of a deal.
Also in general it is more approachable instead of having one big store gateway we have many little ones, which have advantage of easier scheduling (don't need a seperate node for store gw anymore), reducing query latency (every store performs smaller amount of work), etc and many more clients againts Object store, which as experience shows is faster than having a single client against bucket.
IMO block juggling is fine. Maybe I need to improve docs more about overlapping case, so that people don't try to do perfect ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @povilasv. For us this is about horizontal scaling. We're already partitioning by time by having multiple buckets and creating new ones as needed. This PR will allow us to reduce a lot of manual work involved in that and automate the rest.
While there may be better solutions to #448, I think we're going to always have a need for horizontal scaling and we don't have a solution to #448 today. I definitely want to see the memory efficiency improved, but horizontally scaling has allowed us to ignore the problem for the time being. This PR gives us breathing room while that's worked out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and many more clients againts Object store, which as experience shows is faster than having a single client against bucket.
Where you get that from? (:
Ok @claytono @povilasv so maybe from different position. Let's say we have just min-time
max-time
ranges? Nothing block based, but purely time wise. Why is that insufficient?
My main concern is that tieing some user expierience API (store gw flag) to some hidden format that is eventually synced will cause only pain. I would like to avoid this? Having simple time range flags would be both simple, understandable and solve you use case, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where you get that from? (:
For example https://care.qumulo.com/hc/en-us/articles/360015703713-Performance-Results-of-MinIO-with-Qumulo
Take a look at any object store benchmarks, I haven't seen a benchmark which tells that if you use 1 client it's going to be have faster reads then multi client. But you can prove me wrong.
Minio case singe client was: 643 MB/s
Multi-Stream Read: 1.1-1.7GB/s
S3:
Single: 390.8MB/sec
Multi-Stream Read: 1-1.5GB/s
Nothing block based, but purely time wise. Why is that insufficient?
Again time duration based is just a simple way to shard thanos store for large deployments.
My main concern is that tieing some user expierience API (store gw flag) to some hidden format that is eventually synced will cause only pain.
Yes that's the only drawback of this solution, is that people will need to be educated on doing overlapping time ranges, or not doing it at all.
S3 has eventual consistency anyway so in any solution we would have to take similar measures.
I did some stupid simple "benchmarking", opened Thanos Store grafana dashboard loaded and did 3 cases : Current stadard config Thanos Sidecar + 2 Thanos Store, no time partioning 2d 7d 30d New Config 2 Thanos Store's with short term data, 2 Thanos Store's with long term data, only Sidecar's serve < 40h. 30d Note that first screenshots are missing response times for some queries, sry about that |
Lastly, IMO time-sharding is a must. Imagine if you have 10 years of data or 100 years of data, we will need mainframes to handle 100year type of queries. Sharding does solve this. @bwplotka if you are worried about this, maybe we should mark it as experimental and do hidden flags? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this 👍 I can definitely see what problem this solves. I'm not sure but maybe it would be useful to add some logging at the debug level around SyncBlocks()
when a block gets dropped due to these restrictions to reduce potential confusion. WDYT?
Co-Authored-By: povilasv <[email protected]>
0f46aff
to
ffa5baa
Compare
I'm continuing work on a fresh PR #1408 |
Changes
V2 of https://github.com/improbable-eng/thanos/pull/957/files
Doesn't involve any deletions & adds support for durations
#814
Verification
Multiple tests added.
Validated mintime > maxtime behaviour
validated blocks loaded
Checked the metrics:
Actually ran this in dev.