-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pick a heuristic to avoid the effects of large level size compensation #2832
Comments
Heuristic 1: CapCompensated TODO: Might want to try and prioritize the level with the higher raw score if multiple levels end up with the same compensated score. Explanation for the briefly high L0 scores in the grafana run: Once we delete the table in clearrange, we end up in these kind of scenarios where there's barely any data in the level, so the level's rawScore is almost 0. This amplifies the score of the previous level. Look at L2, where it has a rawScore of 0.0, but a compensated size of 546MB, and the low rawScore amplifies the L0 score a ton.
|
Heuristic 4: CompensatedForPreferringHigherLevels Consistently works a little bit better than |
Master Grafana run. The sql statements/sec during the dip is ~25k, and it's consistently less than the |
One possible variant of this: Compute the raw score x. Compute (total compensation / level target score = C). Today, the compensated score is x + C. Instead, compute the compensated score as x + ln(e+C) - 1. This would make the compensated scores grow logarithmically with the amount of data they drop, and makes it continuous so that we sidestep any concerns about equal scores introduced by a hard cap. |
Can you share the |
Internal link for logs: https://drive.google.com/file/d/1kvEuCyz7DlZij9QrH-CFdqHvbI19S6E4/view?usp=sharing The runs were for heuristics 1, 3, 4, and master. The cluster name is present in the logs and can be used to view the grafana dashboard for the run. |
I see directories of the form 1.cap_compensation_1, 1.compensate_higher, 1.master_1, 1.master_2, 1.raw_scores. I assume 1 in the 1. is the node name. What are the heuristics used in each of these? |
cap_compensation is the same as #2832 (comment). raw_scores is #2832 (comment). compensate_higher is #2832 (comment). master_2 is the latest cockroach master using this version of Pebble. The first line in the Pebble logs contains the cluster names. |
cap_compensation_1 dip in sql throughput https://grafana.testeng.crdb.io/d/pZsi7NjVk/admission-control-copy?orgId=1&var-cluster=arjunnair-1692828628-01-n10cpu16&var-instances=All&from=1692832550747&to=1692832975547 Looking at node 2 io_load_listener logs, which only log when there is overload, there are only 4 log entries. Note the low "admitting" rate near the end of each log statement, which is responsible for the queueing.
The pebble logs from the compaction_picker show lower sub-level count, because it compensates for the ongoing compactions. For example, the sub-level count seen by the compaction_picker, 7, is lower than 17 seen by io_load_listener around the same time. We could consider changing what the io_load_listener sees, but there are bigger problems here (see below).
|
For this heuristic: #2832 (comment), there's a chance I'm not setting the raw scores properly for L0. I'm going to make a change and re-run. |
We also have a configuration problem. The following is a summary of compactions over 30s intervals (
This is because Lines 1232 to 1238 in 1a45921
means that if compaction_picker sees a sub-level count (accounting for ongoing compactions) of 7, it thinks a compaction concurrency of 1 is sufficient. Since we now start shaping at a sub-level count of 5 for regular traffic and lower for elastic traffic, I propose we lower this setting to 2. |
Yes, I figured something was wrong. For completeness: raw_scores node 10 is the worst 21:16:00 to 21:20:00 nothing is getting compacted out of L0!
Despite a huge score
|
And there are no huge compactions out of L0 blocking additional compactions either. There is just a gap of 4min 40s:
|
compensate_higher Looking at 19:57:30 to 20:00:00 on node 8 (one of many nodes that show IO tokens exhaustion)
L0 is consistently being picked, so the heuristic is "working"
The problem is that the other levels are being starved, and see above that L2 now has 130MB instead of its goal of 64MB, which means more expensive compactions out of L0, for example the following where 29MB of L2 is being rewritten to compact 593KB out of L0!
The reason the other levels are being starved is compaction concurrency is not increasing:
|
The default in Pebble, 10, delays increasing the compaction concurrency which leads to the bad behavior discussed in cockroachdb/pebble#2832 (comment). The value of 10 was chosen when admission control was not shaping incoming traffic until there were 20 sub-levels. Admission control now shapes regular traffic starting at 5 sub-levels and elastic traffic starting at 1 sub-level. Informs cockroachdb#104862 Informs cockroachdb/pebble#2832 Epic: none Release note: None
I looked again at this compensate_higher run and node 8. Here is the state of the LSM at the start of the increase in sub-levels followed by the state of the LSM near the end of the "overload":
Note that the bytes in the LSM have not changed much, which is why only L0 and L2 have rawScore > 1.0, even though the only compactions happening are L0=>L2. So if we ignore this scenario with compensated bytes (due to DELs and RANGEDELs), and consider something like this happening with normal SETs, we can have a situation where we start doing inefficient compactions out of L0, because compaction concurrency is still 1, and therefore only L0 can get picked. So #109780 would be helpful in general, and not just in these compensation scenarios. But then the question is why are so few bytes being added to the LSM despite so many score driven compactions from L0=>Lbase. The answer turns out to be that we are constantly flushing small memtables, during this time interval, which is not the normal workload. And it is happening because of ingests that overlap with the memtable e.g.
Based on looking at the io_load_listener logs (from the previous comment), I suspect this experiment is not running with cockroachdb/cockroach#109332. Is that correct? I have a suspicion that with that PR we would not see a drop in SQL throughput -- we would still do inefficient compactions out of L0, but AC would not throttle. Can we run with both cockroachdb/cockroach#109332 and cockroachdb/cockroach#109780, on master and see how much the throughput drops -- I wouldn't be surprised if the drop is much smaller. |
109780: storage: set L0CompactionConcurrency to 2 r=bananabrick a=sumeerbhola The default in Pebble, 10, delays increasing the compaction concurrency which leads to the bad behavior discussed in cockroachdb/pebble#2832 (comment). The value of 10 was chosen when admission control was not shaping incoming traffic until there were 20 sub-levels. Admission control now shapes regular traffic starting at 5 sub-levels and elastic traffic starting at 1 sub-level. Informs #104862 Informs cockroachdb/pebble#2832 Epic: none Release note: None Co-authored-by: sumeerbhola <[email protected]>
CompensatedOnlyAsThreshold(raw scores) We see the most minimal impact to foreground throughput out of all the runs. So, this heuristic is working successfully in terms of getting rid of some of the effects of level compensation. Looking at the logs to try and figure out why we still see a slight dip in foreground throughput to see if we can eliminate that. Taking a look at node 10, which has the highest IO token exhaustion during the dip. The compaction logs for a random 15s interval prior to the dip looks like:
One thing I don’t understand is why the “bytes” column for the flush is 320MB, but the “in(B)” column of the L0->L2 compaction is only 175MB. The “bytes” column of the flush should be the total bytes of the sstables written to disk during the flush. The “in(B)” column for the L0->L2 compaction should be the total bytes read from L0 and L2 during the compaction. This means that L0 size should be growing over time. That isn’t the case, so I must be misinterpreting those numbers. Going to ignore the discrepancy for now. The IO token exhaustion occurs from ~4:34 to ~4:38. Here’s the admission control logs during the throughput drop:
At 4:35:24 we’re admitting 4.8GB every 15 seconds, but at 4:35:39, we end up admitting only 41MB. The difference seems to be a growth in sublevels. While it’s difficult to correlate the timestamps in the AC logs with the timestamps in the pebble compaction logs, around this time, we see some ingestions into L0.
We see 22 tiny ingestions into L0 over a 15 seconds interval(4:35:45-4:36:00). I suspect this is contributing to the IO token exhaustion as you have indicated in your previous comments. Also, it seems like we’re not using a compaction concurrency of 3, which you also indicated.
The compaction logs also indicate that we’re flushing 100s of MBs every 15 seconds. But the IO load listener logs indicate that the used tokens/admitted tokens every 15 seconds is much less than a 100MB. I don’t understand this discrepancy either. Is this just incorrect write token estimates? @sumeerbhola There's still some compactions like these, which have tiny L0 files overlapping with largest Lbase files:
Note that the compaction occurs at 4:13, which is much before any level compensation kicks in, so I believe the current master will behave identically. Here's an internal google drive link with the node 10 logs: |
Regarding the other comments:
I no longer think we should use a cap for compensation. The constant we use is arbitrary, and I've seen customer issues where non-L0 levels have higher scores than L0 due to compensation, but the scores aren't higher than 3.
|
Since this was a short 15s window, is it possible L0 size was growing just over the course of the 15s window but not at a longer time scale? If compactions take multiple seconds, it's not unreasonable to get a high variation like this with such a small time window. |
I suspect we are not running with cockroachdb/cockroach#109780 since for a substantial part of the overload we are running with 1 compaction at a time out of L0 even though the compaction_picker is seeing things like (in this case
|
I don't see a discrepancy between the flush stats in:
and
The latter is slightly bigger than flush because there are ingests into L0 (that the compactions tool is not picking up due to a bug). |
As you noted we have very inefficient compactions. This is very pronounced during the overload e.g.
I don't think we should do anything more about this inside Pebble -- the problem is tiny ingests, and we should fix cockroachdb/cockroach#109808 |
The discrepancy doesn't occur during the dip in throughput and IO token exhaustion, but before it, and it lasts for a while. See the logs over 10 minute intervals from 4:10 to 4:30.
|
running
|
Looks like there is still a dip using the raw scores heuristic on top of the latest pebble master, and running with the latest cockroach master. But the dip is tiny and doesn't last long. Most of the IO token exhaustion happens on node 5 from 21:52 to 21:53:15: The L0 sublevels increases, and then goes back to 0 from 21:50:30 to 21:52:45. Here's the 15 second logs from that period. It seems like we are picking compactions almost exclusively out of L0, and we also have high compaction concurrency. The problem is that the compactions picked are super expensive. This seems to correlate with the increased ingestions into L0. I think it's worthwhile to fix this, but I don't think it should block this issue. I don't think the dip is happening due to this heuristic. Although, it's possible that normally(on master) we would've picked compactions out of the non-L0 levels because of compensation, which would've then prevented the expensive L0 -> Lbase compactions.
It was probably due to different version of the pebble log tool being used over the same logs. I tried running the latest pebble build + #2906, on the logs from the latest run, and I don't see this discrepancy. |
Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment). Fixes cockroachdb#109808 Epic: none Release note (ops change): The cluster setting kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB.
Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment). Fixes cockroachdb#109808 Epic: none Release note (ops change): The cluster setting kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB.
During large table drops, we see level compensation become drastically high, which leads to L0 compaction starvation, which in turn leads to AC queueing of requests and increases in foreground latencies. We completely ignore compensation during prioritization of levels during compaction, but we consider compensation when we try and pick a level for compaction. The above approach ensures that L0 isn't starved during large table drops. Note that there might be scenarios where a level say L3, has a high compensated score due to a wide range delete, and L3 will also drop a ton of data in L4, but we end up picking L4 for compaction because it has a higher raw score. While this will make some of the compactions picked inefficient, most of the data which is dropped due to the range delete should be in L6 anyway, and the L5 -> L6 compaction which will contain this range delete will still be cheap. We considered some other heuristics in #2832. Other approaches considered include capping the level compensation to a constant, or accounting for compensation in higher levels, but not lower levels. The first approach might not work well, because the level compensation can be lower than the constant picked and still starve out L0. The second approach is more difficult to reason about compared to the current approach.
Done in #2917 |
Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment). Fixes cockroachdb#109808 Epic: none Release note (ops change): The cluster setting kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB.
110943: kvserver,storage: ingest small snapshot as writes r=itsbilal,erikgrinaker a=sumeerbhola Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment). Fixes #109808 Epic: none Release note (ops change): The cluster setting kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB. 111627: encoding: fix UnsafeConvertStringToBytes to work with large input strings r=ecwall a=ecwall Fixes #111626 The previous impl assumed input string length <= math.MaxInt32. Go 1.20 added unsafe.StringData (https://pkg.go.dev/unsafe#StringData) which properly handles longer strings. This changes the impl to use unsafe.StringData and adds a unit test. Release note (bug fix): Fixed a panic that could occur if a query uses a string larger than 2^31-1 bytes. 111656: cluster-ui: pin `pnpm` to `8.6.10` for cluster-ui-release workflow r=THardy98 a=THardy98 Epic: none This change pins `pnpm` to `8.6.10` for the cluster-ui release (and release-next) workflow(s) to prevent not up-to-date lockfiles when installing cluster-ui dependencies with pnpm. Release note: None Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Thomas Hardy <[email protected]>
Link to the compaction scoring internal doc.
The setup for the experiments was to run
clearrange/checks=true
.The text was updated successfully, but these errors were encountered: