Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pick a heuristic to avoid the effects of large level size compensation #2832

Closed
2 tasks done
bananabrick opened this issue Aug 16, 2023 · 27 comments
Closed
2 tasks done

Comments

@bananabrick
Copy link
Contributor

bananabrick commented Aug 16, 2023

Link to the compaction scoring internal doc.

The setup for the experiments was to run clearrange/checks=true.

  • post experimental results with prs for the changes.
  • post a blurb about the results for each run, comparing it to the other runs.
@bananabrick bananabrick self-assigned this Aug 16, 2023
@bananabrick bananabrick changed the title pick a heuristic avoid the effects of large level size compensation pick a heuristic to avoid the effects of large level size compensation Aug 16, 2023
@bananabrick
Copy link
Contributor Author

bananabrick commented Aug 16, 2023

Heuristic 1: CapCompensated
Summary: Caps the compensated score of a level to 3x the target level size. If the rawScore is higher than 3, we use that instead. If every level has compensation, then it's possible that the score of each level will be capped to 3. But we divide the score of a level by the rawScore of the next level, so each level should still end up with slightly different scores.
PR: #2827
Grafana run.

TODO: Might want to try and prioritize the level with the higher raw score if multiple levels end up with the same compensated score.


Explanation for the briefly high L0 scores in the grafana run:

Once we delete the table in clearrange, we end up in these kind of scenarios where there's barely any data in the level, so the level's rawScore is almost 0. This amplifies the score of the previous level. Look at L2, where it has a rawScore of 0.0, but a compensated size of 546MB, and the low rawScore amplifies the L0 score a ton.

I230817 20:53:34.614874 10148619 3@pebble/compaction_picker.go:1301 ⋮ [n2,s2,pebble] 11328 +‹  *L0: 500.0    0.0    0.0     69MB     8.0EB›
I230817 20:53:34.614874 10148619 3@pebble/compaction_picker.go:1301 ⋮ [n2,s2,pebble] 11328 +‹   L2:  42.8    3.0    0.0    546MB      64MB›
I230817 20:53:34.614874 10148619 3@pebble/compaction_picker.go:1301 ⋮ [n2,s2,pebble] 11328 +‹   L3:   2.6    2.3    0.1    1.0GB     461MB›
I230817 20:53:34.614874 10148619 3@pebble/compaction_picker.go:1301 ⋮ [n2,s2,pebble] 11328 +‹   L4:   3.0    3.0    0.9    136GB     3.2GB›
I230817 20:53:34.614874 10148619 3@pebble/compaction_picker.go:1301 ⋮ [n2,s2,pebble] 11328 +‹   L5:   3.0    2.9    1.0     67GB      23GB›
I230817 20:53:34.614874 10148619 3@pebble/compaction_picker.go:1301 ⋮ [n2,s2,pebble] 11328 +‹   L6:   1.0    1.0    1.0    163GB     168GB›

The dip is at ~48k sql statements/sec
Screenshot 2023-08-17 at 6 08 46 PM

@bananabrick
Copy link
Contributor Author

bananabrick commented Aug 16, 2023

Heuristic 2: NoCompensateL0
Summary: L0 ignores compensation in any level beneath L0. See PR for more details.
PR: #2731
Grafana run.


This works fine in terms of sql statements/sec, but there's some concern that L0 will be prioritized too much.

@bananabrick
Copy link
Contributor Author

bananabrick commented Aug 16, 2023

Heuristic 3: CompensatedOnlyAsThreshold
Summary: We rank the levels by rawSmoothedScores, but pick a compaction out of a level if its compensated smoothed score is > 1.
Commit: 2dab8c1
Grafana run.


Screenshot 2023-08-31 at 1 25 15 AM

@bananabrick
Copy link
Contributor Author

bananabrick commented Aug 16, 2023

Heuristic 4: CompensatedForPreferringHigherLevels
Summary: Higher levels in the lsm ignore compensation in the lower levels. This ensures that L0 isn't affected too much by compensation in lower levels. This heuristic differs NoCompensateL0 because it prioritizes any higher level when compared to a lower level whereas NoCompensateL0 only prioritizes L0. from See PR for more details.
Commit: ce5f640
Grafana run.


Consistently works a little bit better than CapCompensated. The drop in throughput is consistently less than in CapCompensated, but not by much.

The dip is ~50k sql statements/sec.
Screenshot 2023-08-17 at 6 14 47 PM

@bananabrick
Copy link
Contributor Author

bananabrick commented Aug 17, 2023

Master

Grafana run.


The sql statements/sec during the dip is ~25k, and it's consistently less than the CapCompensated, and the CompensatedForPreferringHigherLevels heuristics.

Screenshot 2023-08-17 at 6 17 27 PM

@jbowens
Copy link
Collaborator

jbowens commented Aug 18, 2023

Heuristic 1: CapCompensated

One possible variant of this: Compute the raw score x. Compute (total compensation / level target score = C). Today, the compensated score is x + C. Instead, compute the compensated score as x + ln(e+C) - 1. This would make the compensated scores grow logarithmically with the amount of data they drop, and makes it continuous so that we sidestep any concerns about equal scores introduced by a hard cap.

@sumeerbhola
Copy link
Collaborator

Can you share the compaction_picker.go:1301 logs for the runs with heuristic 1, 3, 4 starting from just before the dip and continuing a bit after things return to normal?

@bananabrick
Copy link
Contributor Author

bananabrick commented Aug 27, 2023

Internal link for logs: https://drive.google.com/file/d/1kvEuCyz7DlZij9QrH-CFdqHvbI19S6E4/view?usp=sharing

The runs were for heuristics 1, 3, 4, and master. The cluster name is present in the logs and can be used to view the grafana dashboard for the run.

@sumeerbhola
Copy link
Collaborator

I see directories of the form 1.cap_compensation_1, 1.compensate_higher, 1.master_1, 1.master_2, 1.raw_scores. I assume 1 in the 1. is the node name. What are the heuristics used in each of these?

@bananabrick
Copy link
Contributor Author

cap_compensation is the same as #2832 (comment).

raw_scores is #2832 (comment).

compensate_higher is #2832 (comment).

master_2 is the latest cockroach master using this version of Pebble.

The first line in the Pebble logs contains the cluster names.

@sumeerbhola
Copy link
Collaborator

cap_compensation_1
cluster=arjunnair-1692828628-01-n10cpu16

dip in sql throughput https://grafana.testeng.crdb.io/d/pZsi7NjVk/admission-control-copy?orgId=1&var-cluster=arjunnair-1692828628-01-n10cpu16&var-instances=All&from=1692832550747&to=1692832975547
Nodes 2 and 8 are the biggest contributors in terms of IO token exhaustion

Screenshot 2023-08-30 at 3 08 00 PM

Looking at node 2 io_load_listener logs, which only log when there is overload, there are only 4 log entries. Note the low "admitting" rate near the end of each log statement, which is responsible for the queueing.

I230823 23:17:57.937708 777 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n2,s2] 527  IO overload: compaction score 0.900 (94 ssts, 18 sub-levels), L0 growth 51 MiB (write 50 MiB ingest 87 KiB ignored 18 KiB): requests 824657 (8898 bypassed) with 109 MiB acc-write (2.9 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.19x+1 B) + at-admission-tokens 57 B, compacted 36 MiB [≈39 MiB], flushed 2.2 GiB [≈2.3 GiB]; admitting 32 MiB (rate 2.1 MiB/s) due to L0 growth (used 56 MiB); write stalls 0

I230823 23:18:12.937977 777 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n2,s2] 745  IO overload: compaction score 0.850 (93 ssts, 17 sub-levels), L0 growth 54 MiB (write 54 MiB ingest 246 KiB ignored 41 KiB): requests 785153 (9577 bypassed) with 105 MiB acc-write (3.8 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.09x+1 B) + at-admission-tokens 64 B, compacted 65 MiB [≈52 MiB], flushed 1.6 GiB [≈2.0 GiB]; admitting 33 MiB (rate 2.2 MiB/s) due to L0 growth (used 33 MiB); write stalls 0

I230823 23:18:27.939120 777 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n2,s2] 1019  IO overload: compaction score 1.350[L0-overload] (157 ssts, 27 sub-levels), L0 growth 61 MiB (write 61 MiB ingest 275 KiB ignored 45 KiB): requests 824125 (10661 bypassed) with 110 MiB acc-write (3.8 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.54x+1 B (smoothed 0.52x+1 B) + ingested-model 0.00x+1 B (smoothed 0.05x+1 B) + at-admission-tokens 70 B, compacted 48 MiB [≈50 MiB], flushed 1.4 GiB [≈1.7 GiB]; admitting 29 MiB (rate 1.9 MiB/s) due to L0 growth (used 33 MiB); write stalls 0

I230823 23:18:42.939493 777 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n2,s2] 1286  IO overload: compaction score 2.150[L0-overload] (239 ssts, 43 sub-levels), L0 growth 75 MiB (write 66 MiB ingest 8.4 MiB ignored 8.1 MiB): requests 829800 (11830 bypassed) with 111 MiB acc-write (4.7 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.59x+1 B (smoothed 0.56x+1 B) + ingested-model 0.00x+1 B (smoothed 0.02x+1 B) + at-admission-tokens 77 B, compacted 58 MiB [≈54 MiB], flushed 1.0 GiB [≈1.4 GiB]; admitting 28 MiB (rate 1.9 MiB/s) due to L0 growth (used 29 MiB); write stalls 0

I230823 23:18:57.940203 777 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n2,s2] 1483  IO overload: compaction score 1.250[L0-overload] (159 ssts, 25 sub-levels), L0 growth 68 MiB (write 50 MiB ingest 17 MiB ignored 17 MiB): requests 598703 (9164 bypassed) with 82 MiB acc-write (3.7 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.61x+1 B (smoothed 0.58x+1 B) + ingested-model 0.00x+1 B (smoothed 0.01x+1 B) + at-admission-tokens 82 B, compacted 77 MiB [≈66 MiB], flushed 948 MiB [≈1.1 GiB]; admitting 30 MiB (rate 2.0 MiB/s) due to L0 growth (used 28 MiB); write stalls 0

The pebble logs from the compaction_picker show lower sub-level count, because it compensates for the ongoing compactions. For example, the sub-level count seen by the compaction_picker, 7, is lower than 17 seen by io_load_listener around the same time. We could consider changing what the io_load_listener sees, but there are bigger problems here (see below).

I230823 23:18:13.666790 2974652 3@pebble/compaction_picker.go:1316 ⋮ [n2,s2,pebble] 39327  pickAuto: L4->L5
I230823 23:18:13.666790 2974652 3@pebble/compaction_picker.go:1316 ⋮ [n2,s2,pebble] 39327 +‹   L0:   2.5    7.0    7.0    572MB      22MB     8.0EB  [L0->L2]›
I230823 23:18:13.666790 2974652 3@pebble/compaction_picker.go:1316 ⋮ [n2,s2,pebble] 39327 +‹   L2:   3.7    3.0    2.8    211MB     165MB      64MB›
I230823 23:18:13.666790 2974652 3@pebble/compaction_picker.go:1316 ⋮ [n2,s2,pebble] 39327 +‹   L3:   2.9    3.0    0.8     47GB     369MB     457MB›
I230823 23:18:13.666790 2974652 3@pebble/compaction_picker.go:1316 ⋮ [n2,s2,pebble] 39327 +‹  *L4:   3.0    3.0    1.0     84GB     3.3GB     3.2GB›
I230823 23:18:13.666790 2974652 3@pebble/compaction_picker.go:1316 ⋮ [n2,s2,pebble] 39327 +‹   L5:   3.0    2.9    1.0     66GB      22GB      23GB›
I230823 23:18:13.666790 2974652 3@pebble/compaction_picker.go:1316 ⋮ [n2,s2,pebble] 39327 +‹   L6:   1.0    1.0    0.9    157GB     154GB     162GB›
  • The cap of 3.0 is getting applied to 3 levels, and despite that their score ratio is higher than that of L0. Given a compaction concurrency of 3, it is not surprising that they are getting preferred to L0. (I've picked one of the many such examples in the logs).
  • Because multiple levels have a high score ratio, Lbase (L2 in this case) is building up a backlog of real bytes. Note that its real size is 165MB in the above log statement. This makes compactions from L0 => L2 longer, which increases the difference between the sub-level count seen by io_load_listener and the compaction_picker. An example of such a compaction is shown below.
cockroach-pebble.arjunnair-1692828628-01-n10cpu16-0002.ubuntu.2023-08-23T22_55_11Z.013393.log:I230823 23:18:20.203190 2974548 3@pebble/event.go:673 ⋮ [n2,s2,pebble] 39803  [JOB 6942] compacted(default) L0 [039284 039289 039285 039286 039287 039338 039343 039499 039497 039473 039483 039479 039475 039476 039478 039481 039480 039474 039477 039472 039464 039469 039465 039466 039467 039496 039495 039493 039494 039492 039484 039489 039485 039486 039487 039514 039511 039512 039513 039510 039501 039506 039502 039503 039504 039526 039529 039528 039527 039552 039551 039556 039555 039554 039553 039550 039576 039575 039572 039571 039573 039577 039574 039588] (14MB) + L2 [039320 039362 039377 039397 039398 039400 039401 039403 039411 039423 039438 039439 039450 039460 039462 039463 039482 039498 039509 039515 039530 039531 039533 039534 039536 039538 039548 039557 039560 039587 039594 039595 039596 039597 039599 039601] (165MB) -> L2 [039602 039637 039642 039663 039678 039680 039697 039712 039714 039718 039727 039746 039752 039781 039797 039798 039808 039814 039816 039818 039819 039821 039822 039824 039825 039827 039828 039830 039832 039839 039844 039857 039860 039862 039865 039867 039870] (176MB), in 7.3s (7.3s total), output rate 24MB/s

@bananabrick
Copy link
Contributor Author

For this heuristic: #2832 (comment), there's a chance I'm not setting the raw scores properly for L0. I'm going to make a change and re-run.

@sumeerbhola
Copy link
Collaborator

We also have a configuration problem. The following is a summary of compactions over 30s intervals (pebble db logs compactions cockroach-pebble.arjunnair-1692828628-01-n10cpu16-0002.ubuntu.2023-08-23T2* --window 30s). For some of the overload time we are running with a compaction concurrency of 2

node: 2, store: 2
   from: 230823 23:17
     to: 230823 23:17
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        14    88MB        4s
total                                                          14    88MB        4s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2        22       0       0       0      22   539MB   532MB      0B      0B       18s
compact      L2      L3        18       0       0       0      18   242MB   239MB      0B      0B       10s
compact      L3      L4         4       8       0       0      12   236MB   236MB    69MB      0B        2s
compact      L6      L6         0       0       1       1       2    64MB    64MB      0B    13MB        0s
total                          44       8       1       1      54   1.1GB   1.0GB    69MB    13MB       31s

node: 2, store: 2
   from: 230823 23:17
     to: 230823 23:18
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        37    96MB        5s
total                                                          37    96MB        5s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0         8       0       0       0       8    42MB    33MB      0B      0B        1s
compact      L0      L2        10       0       0       0      10   639MB   628MB      0B      0B       22s
compact      L2      L3         9       0       0       0       9   142MB   139MB      0B      0B        4s
compact      L3      L4         4       0       0       0       4   314MB   313MB      0B      0B        2s
compact      L4      L5        33       1       0       0      34   5.3GB   5.3GB   694KB      0B       29s
compact      L6      L6         0       0       1       0       1    64MB    64MB      0B      0B        0s
total                          64       1       1       0      66   6.5GB   6.4GB   694KB      0B      1m0s

node: 2, store: 2
   from: 230823 23:18
     to: 230823 23:18
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        87   116MB        6s
ingest               L0                                        11   6.5KB
ingest               L4                                         1   1.0KB
total                                                          99   116MB        6s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0        15       0       0       0      15   103MB    67MB      0B      0B        3s
compact      L0      L2         7       0       0       0       7   818MB   806MB      0B      0B       32s
compact      L2      L3         8       2       0       0      10    60MB    58MB    12MB      0B        3s
compact      L4      L5         5       1       0       0       6   683MB   681MB   1.0KB      0B        4s
compact      L5      L6        20       0       0       0      20   5.6GB   5.6GB      0B      0B       32s
total                          55       3       0       0      58   7.2GB   7.1GB    12MB      0B     1m17s

This is because Experimental.L0CompactionConcurrency = 10, which coupled with the logic in

pebble/compaction_picker.go

Lines 1232 to 1238 in 1a45921

l0ReadAmp := p.vers.L0Sublevels.MaxDepthAfterOngoingCompactions()
compactionDebt := int(p.estimatedCompactionDebt(0))
ccSignal1 := n * p.opts.Experimental.L0CompactionConcurrency
ccSignal2 := n * p.opts.Experimental.CompactionDebtConcurrency
if l0ReadAmp < ccSignal1 && compactionDebt < ccSignal2 {
return nil
}

means that if compaction_picker sees a sub-level count (accounting for ongoing compactions) of 7, it thinks a compaction concurrency of 1 is sufficient.

Since we now start shaping at a sub-level count of 5 for regular traffic and lower for elastic traffic, I propose we lower this setting to 2.

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Aug 30, 2023

For this heuristic: #2832 (comment), there's a chance I'm not setting the raw scores properly for L0. I'm going to make a change and re-run.

Yes, I figured something was wrong. For completeness:

raw_scores
cluster=arjunnair-1692820915-01-n10cpu16

node 10 is the worst 21:16:00 to 21:20:00
https://grafana.testeng.crdb.io/d/pZsi7NjVk/admission-control-copy?orgId=1&var-cluster=arjunnair-1692820915-01-n10cpu16&var-instances=All&from=1692825242930&to=1692825756535
Screenshot 2023-08-30 at 4 28 01 PM

nothing is getting compacted out of L0!

I230823 21:15:27.453540 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 506  IO overload: compaction score 0.900 (43 ssts, 18 sub-levels), L0 growth 13 MiB (write 13 MiB ingest 8.4 KiB ignored 9.4 KiB): requests 266036 (4824 bypassed) with 33 MiB acc-write (531 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.40x+1 B) + at-admission-tokens 51 B, compacted 421 KiB [≈7.1 MiB], flushed 2.2 GiB [≈2.4 GiB]; admitting 9.0 MiB (rate 615 KiB/s) due to L0 growth (used 17 MiB); write stalls 0

I230823 21:15:42.453842 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 625  IO overload: compaction score 2.300[L0-overload] (102 ssts, 46 sub-levels), L0 growth 12 MiB (write 12 MiB ingest 20 KiB ignored 19 KiB): requests 173607 (5366 bypassed) with 22 MiB acc-write (708 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.55x+1 B (smoothed 0.52x+1 B) + ingested-model 0.00x+1 B (smoothed 0.20x+1 B) + at-admission-tokens 61 B, compacted 0 B [≈3.5 MiB], flushed 1.4 GiB [≈1.9 GiB]; admitting 5.4 MiB (rate 368 KiB/s) due to L0 growth (used 9.2 MiB); write stalls 0

I230823 21:15:57.454261 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 737  IO overload: compaction score 4.900[L0-overload] (208 ssts, 98 sub-levels), L0 growth 11 MiB (write 11 MiB ingest 32 KiB ignored 28 KiB): requests 128272 (5139 bypassed) with 16 MiB acc-write (697 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.69x+1 B (smoothed 0.61x+1 B) + ingested-model 0.00x+1 B (smoothed 0.10x+1 B) + at-admission-tokens 76 B, compacted 0 B [≈1.8 MiB], flushed 452 MiB [≈1.2 GiB]; admitting 3.1 MiB (rate 214 KiB/s) due to L0 growth (used 5.4 MiB); write stalls 0

I230823 21:16:12.454498 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 862  IO overload: compaction score 7.500[L0-overload] (316 ssts, 150 sub-levels), L0 growth 12 MiB (write 12 MiB ingest 33 KiB ignored 28 KiB): requests 131059 (4889 bypassed) with 16 MiB acc-write (670 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.73x+1 B (smoothed 0.67x+1 B) + ingested-model 0.00x+1 B (smoothed 0.05x+1 B) + at-admission-tokens 86 B, compacted 0 B [≈907 KiB], flushed 380 MiB [≈780 MiB]; admitting 1.8 MiB (rate 122 KiB/s) due to L0 growth (used 3.1 MiB); write stalls 0

I230823 21:16:27.468918 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 1044  IO overload: compaction score 9.900[L0-overload] (375 ssts, 198 sub-levels), L0 growth 1.7 MiB (write 1.7 MiB ingest 36 KiB ignored 27 KiB): requests 9328 (6154 bypassed) with 2.0 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.85x+1 B (smoothed 0.76x+1 B) + ingested-model 0.00x+1 B (smoothed 0.03x+1 B) + at-admission-tokens 137 B, compacted 0 B [≈454 KiB], flushed 117 MiB [≈449 MiB]; admitting 1.0 MiB (rate 69 KiB/s) due to L0 growth (used 1.8 MiB); write stalls 0

I230823 21:16:42.469383 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 1182  IO overload: compaction score 12.650[L0-overload] (440 ssts, 253 sub-levels), L0 growth 1.3 MiB (write 1.3 MiB ingest 44 KiB ignored 29 KiB): requests 6094 (5751 bypassed) with 1.6 MiB acc-write (1.6 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.80x+1 B (smoothed 0.78x+1 B) + ingested-model 1.41x+1 B (smoothed 0.72x+1 B) + at-admission-tokens 181 B, compacted 0 B [≈227 KiB], flushed 76 MiB [≈262 MiB]; admitting 572 KiB (rate 38 KiB/s) due to L0 growth (used 1.2 MiB); write stalls 0

I230823 21:16:57.469894 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 1304  IO overload: compaction score 15.750[L0-overload] (570 ssts, 315 sub-levels), L0 growth 9.9 MiB (write 9.8 MiB ingest 54 KiB ignored 32 KiB): requests 101539 (5517 bypassed) with 14 MiB acc-write (1.5 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.72x+1 B (smoothed 0.75x+1 B) + ingested-model 0.00x+1 B (smoothed 0.36x+1 B) + at-admission-tokens 141 B, compacted 0 B [≈113 KiB], flushed 249 MiB [≈256 MiB]; admitting 314 KiB (rate 21 KiB/s) due to L0 growth (used 1.2 MiB); write stalls 0

I230823 21:17:12.471097 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 1406  IO overload: compaction score 18.450[L0-overload] (660 ssts, 369 sub-levels), L0 growth 1.3 MiB (write 1.2 MiB ingest 91 KiB ignored 30 KiB): requests 5676 (4977 bypassed) with 1.5 MiB acc-write (1.4 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.78x+1 B (smoothed 0.77x+1 B) + ingested-model 1.50x+9 B (smoothed 0.93x+5 B) + at-admission-tokens 183 B, compacted 0 B [≈57 KiB], flushed 59 MiB [≈158 MiB]; admitting 171 KiB (rate 11 KiB/s) due to L0 growth (used 1.1 MiB); write stalls 0

I230823 21:17:27.471582 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 1462  IO overload: compaction score 19.550[L0-overload] (702 ssts, 391 sub-levels), L0 growth 662 KiB (write 593 KiB ingest 69 KiB ignored 12 KiB): requests 3858 (3729 bypassed) with 1.1 MiB acc-write (1.1 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.53x+1 B (smoothed 0.65x+1 B) + ingested-model 1.50x+13 B (smoothed 1.21x+9 B) + at-admission-tokens 177 B, compacted 0 B [≈28 KiB], flushed 112 MiB [≈158 MiB]; admitting 93 KiB (rate 6.2 KiB/s) due to L0 growth (used 870 KiB); write stalls 0

I230823 21:17:42.471959 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 1466  IO overload: compaction score 19.600[L0-overload] (703 ssts, 392 sub-levels), L0 growth 147 KiB (write 147 KiB ingest 0 B ignored 0 B): requests 510 (492 bypassed) with 99 KiB acc-write (95 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 1.48x+1 B (smoothed 1.07x+1 B) + ingested-model 0.00x+0 B (smoothed 1.21x+4 B) + at-admission-tokens 235 B, compacted 0 B [≈14 KiB], flushed 494 MiB [≈158 MiB]; admitting 50 KiB (rate 3.3 KiB/s) due to L0 growth (used 69 KiB); write stalls 0

I230823 21:17:57.472330 1125 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n10,s10] 1471  IO overload: compaction score 19.600[L0-overload] (703 ssts, 392 sub-levels), L0 growth 0 B (write 0 B ingest 0 B ignored 0 B): requests 627 (619 bypassed) with 141 KiB acc-write (140 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.00x+0 B (smoothed 1.07x+1 B) + ingested-model 0.00x+0 B (smoothed 1.21x+2 B) + at-admission-tokens 235 B, compacted 0 B [≈7.1 KiB], flushed 0 B [≈158 MiB]; admitting 27 KiB (rate 1.8 KiB/s) due to L0 growth (used 153 KiB); write stalls 0

Despite a huge score

I230823 21:15:57.014917 8945271 3@pebble/compaction_picker.go:1329 ⋮ [n10,s10,pebble] 17424  pickAuto: L5->L6
I230823 21:15:57.014917 8945271 3@pebble/compaction_picker.go:1329 ⋮ [n10,s10,pebble] 17424 +‹   L0:  55.5   49.0   49.0    131GB      40MB     8.0EB›
I230823 21:15:57.014917 8945271 3@pebble/compaction_picker.go:1329 ⋮ [n10,s10,pebble] 17424 +‹   L2:  14.4   13.1    0.9    838MB      56MB      64MB›
I230823 21:15:57.014917 8945271 3@pebble/compaction_picker.go:1329 ⋮ [n10,s10,pebble] 17424 +‹   L3:   8.5    8.3    0.9    3.7GB     420MB     461MB›
I230823 21:15:57.014917 8945271 3@pebble/compaction_picker.go:1329 ⋮ [n10,s10,pebble] 17424 +‹   L4:  13.3   13.0    1.0     42GB     3.2GB     3.2GB›
I230823 21:15:57.014917 8945271 3@pebble/compaction_picker.go:1329 ⋮ [n10,s10,pebble] 17424 +‹  *L5:   1.3    1.2    1.0     29GB      23GB      23GB  [L5->L6 L5->L6]›
I230823 21:15:57.014917 8945271 3@pebble/compaction_picker.go:1329 ⋮ [n10,s10,pebble] 17424 +‹   L6:   1.0    1.0    1.0    169GB     161GB     169GB›

@sumeerbhola
Copy link
Collaborator

And there are no huge compactions out of L0 blocking additional compactions either. There is just a gap of 4min 40s:

cockroach-pebble.arjunnair-1692820915-01-n10cpu16-0010.ubuntu.2023-08-23T20_53_11Z.012802.log:I230823 21:15:07.265249 8714966 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 15545  [JOB 3375] compacted(default) L0 [032280] (2.1MB) + L2 [032277 032281 032282] (16MB) -> L2 [032300 032301 032302 032303] (18MB), in 0.5s (0.5s total), output rate 35MB/s

cockroach-pebble.arjunnair-1692820915-01-n10cpu16-0010.ubuntu.2023-08-23T20_53_11Z.012802.log:I230823 21:15:12.667687 8751662 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 15599  [JOB 3384] compacted(default) L0 [032308] (421KB) + L2 [032284 032285 032286 032287 032288 032289 032290 032291] (16MB) -> L2 [032313 032314 032315 032316 032317 032318 032331 032332 032333] (17MB), in 0.5s (0.5s total), output rate 32MB/s
cockroach-pebble.arjunnair-1692820915-01-n10cpu16-0010.ubuntu.2023-08-23T20_53_11Z.012802.log:I230823 21:19:52.273584 9120344 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 25236  [JOB 5616] compacted(default) L0 [032311 032309 032312 032310 032328 032330 032329 032327 032324 032426 032427 032428 032424 032421 032450 032449 032448 032445 032473 032474 032472 032469 032486 032485 032484 032481 032499 032500 032498 032494 032512 032511 032510 032507 032538 032539 032540 032536 032533 032554 032553 032552 032549 032573 032574 032571 032568 032592 032593 032589 032580 032587 032615 032614 032606 032603 032619 032618 032617 032612 032631 032632 032630 032627 032650 032651 032648 032645 032664 032665 032663 032660 032679 032678 032677 032674 032705 032706 032704 032701 032716 032718 032715 032712 032751 032752 032750 032746 032742 032745 032766 032767 032765 032762 032784 032783 032782 032779 032802 032803 032801 032798 032816 032817 032814 032811 032828 032829 032827 032824 032842 032841 032840 032835 032853 032852 032851 032848 032844 032847 032874 032876 032872 032862 032879 032880 032878 032868 032891 032892 032890 032887 032902 032903 032901 032898 032923 032926 032921 032912 032930 032931 032929 032919 032944 032945 032942 032938 032958 032959 032957 032953 032973 032980 032972 032969 032979 033013 033014 033010 033001 033017 033016 033007 033064 033065 033063 033060 033085 033086 033084 033081 033098 033099 033097 033094 033123 033124 033122 033119 033134 033136 033133 033130 033149 033150 033148 033145 033160 033161 033159 033156 033172 033173 033171 033168 033194 033195 033190 033181 033198 033197 033196 033188 033214 033215 033213 033210 033221 033232 033233 033231 033228 033247 033248 033246 033243 033259 033260 033258 033255 033273 033272 033271 033268 033284 033285 033283 033280 033300 033301 033299 033296 033311 033312 033310 033307 033329 033330 033328 033325 033343 033344 033342 033339 033364 033365 033356 033353 033369 033368 033367 033362 033381 033380 033379 033375 033399 033400 033398 033395 033413 033414 033412 033409 033425 033426 033424 033421 033443 033444 033442 033438 033458 033459 033456 033453 033473 033474 033472 033469 033484 033485 033483 033480 033503 033505 033496 033492 033508 033509 033507 033502 033521 033520 033519 033516 033534 033536 033533 033530 033526 033527 033529 033547 033548 033546 033542 033559 033560 033558 033555 033571 033572 033569 033566 033590 033591 033589 033586 033610 033611 033609 033606 033624 033621 033649 033633 033629 033653 033645 033639 033681 033678 033700 033696 033717 033713 033727 033723 033737 033734 033755 033752 033766 033762 033775 033772 033790 033787 033802 033799 033814 033810 033832 033829 033825 033828 033850 033846 033842 033845 033863 033860 033876 033872 033886 033883 033897 033893 033909 033906 033932 033929 033942 033939 033958 033953 033969 033965 033982 033979 033993 033990 034005 034002 034019 034016 034033 034029 034025 034046 034043 034061 034058 034054 034057 034072 034069 034083 034079 034075 034078 034100 034097 034114 034111 034128 034124 034120 034121 034123 034140 034137 034149 034146 034166 034163 034178 034175 034188 034185 034206 034203 034218 034214 034230 034227 034242 034239 034252 034249 034266 034263 034284 034275 034288 034282 034301 034298 034313 034310 034332 034329 034345 034346 034344 034341 034357 034356 034355 034352 034370 034377 034369 034365 034381 034382 034380 034376 034395 034396 034394 034391 034387 034414 034413 034411 034402 034417 034418 034416 034410 034406 034407 034409 034439 034440 034438 034434 034451 034452 034450 034447 034467 034468 034466 034462 034483 034484 034482 034478 034506 034507 034504 034501 034522 034523 034521 034518 034539 034540 034538 034534 034530 034533 034551 034552 034550 034547 034562 034563 034561 034558 034575 034576 034574 034571 034599 034597 034589 034586 034601 034602 034600 034595 034614 034615 034613 034610 034636 034638 034634 034625 034641 034642 034639 034633 034629 034632 034654 034655 034653 034650 034666 034667 034665 034661 034680 034681 034679 034676 034672 034675 034691 034692 034690 034687 034705 034704 034703 034700 034721 034722 034720 034716 034739 034740 034738 034735 034758 034759 034757 034754 034771 034768 034783 034780 034806 034803 034799 034800 034802 034818 034815 034829 034826 034822 034823 034825 034842 034838 034834 034835 034837 034856 034852 034848 034851 034876 034872 034868 034871 034890 034887 034899 034896 034910 034907 034903 034904 034906 034927 034918 034914 034917 034926 034922 034925 034939 034936 034949 034946 034959 034960 034958 034955 034972 034973 034971 034968 034991 034993 034989 034980 034995 034987 034983 034984 034986 035017 035014 035010 035013 035030 035027 035040 035037 035049 035046 035060 035057 035080 035071 035083 035077 035095 035091 035107 035104 035123 035120 035116 035117 035119 035137 035134 035148 035145 035141 035142 035144 035159 035156 035152 035155 035171 035168 035164 035165 035167 035185 035182 035178 035179 035195 035192 035216 035213 035209 035210 035226 035222 035218 035221 035242 035239 035235 035236 035238 035256 035253 035330 035672 035858 035854 035855 035856 035851 035938 035937 035933 035934 035932] (82MB) + L2 [032298 032299 032300 032301 032303] (21MB) -> L2 [036111 036115 036117 036118 036119 036120 036121 036122 036123 036124 036125 036126 036127 036128 036129] (72MB), in 3.0s (3.0s total), output rate 24MB/s

cockroach-pebble.arjunnair-1692820915-01-n10cpu16-0010.ubuntu.2023-08-23T20_53_11Z.012802.log:I230823 21:19:56.157052 9122846 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 26224  [JOB 5687] compacted(default) L0 [036161 036158] (518KB) + L2 [] (0B) -> L2 [036232] (372KB), in 0.0s (0.0s total), output rate 30MB/s

cockroach-pebble.arjunnair-1692820915-01-n10cpu16-0010.ubuntu.2023-08-23T20_53_11Z.012802.log:I230823 21:20:04.943553 9156942 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 32540  [JOB 5859] compacted(default) L0 [036255] (2.3MB) + L2 [036232] (372KB) -> L2 [036260] (2.5MB), in 0.2s (0.2s total), output rate 15MB/s

@sumeerbhola
Copy link
Collaborator

compensate_higher
cluster=arjunnair-1692816704-01-n10cpu16
https://grafana.testeng.crdb.io/d/pZsi7NjVk/admission-control-copy?orgId=1&var-cluster=arjunnair-1692816704-01-n10cpu16&var-instances=All&from=1692820581272&to=1692820920589
Screenshot 2023-08-30 at 4 25 44 PM

Looking at 19:57:30 to 20:00:00 on node 8 (one of many nodes that show IO tokens exhaustion)

I230823 19:57:25.976145 1013 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n8,s8] 384  IO overload: compaction score 0.500 (19 ssts, 10 sub-levels), L0 growth 29 MiB (write 29 MiB ingest 14 KiB ignored 14 KiB): requests 497813 (4981 bypassed) with 66 MiB acc-write (4.4 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+0 B (smoothed 0.75x+1 B) + at-admission-tokens 52 B, compacted 26 MiB [≈26 MiB], flushed 2.3 GiB [≈2.6 GiB]; admitting 26 MiB (rate 1.7 MiB/s) due to L0 growth (used 34 MiB); write stalls 0

I230823 19:57:40.976364 1013 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n8,s8] 594  IO overload: compaction score 0.750 (54 ssts, 15 sub-levels), L0 growth 31 MiB (write 31 MiB ingest 40 KiB ignored 32 KiB): requests 479322 (4662 bypassed) with 64 MiB acc-write (5.1 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.38x+1 B) + at-admission-tokens 59 B, compacted 27 MiB [≈26 MiB], flushed 2.0 GiB [≈2.3 GiB]; admitting 23 MiB (rate 1.5 MiB/s) due to L0 growth (used 26 MiB); write stalls 0

I230823 19:58:10.976969 1013 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n8,s8] 1092  IO overload: compaction score 1.300[L0-overload] (66 ssts, 26 sub-levels), L0 growth 37 MiB (write 37 MiB ingest 92 KiB ignored 54 KiB): requests 450884 (4487 bypassed) with 61 MiB acc-write (5.2 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.59x+1 B (smoothed 0.55x+1 B) + ingested-model 0.00x+1 B (smoothed 0.09x+1 B) + at-admission-tokens 75 B, compacted 35 MiB [≈32 MiB], flushed 1.6 GiB [≈1.9 GiB]; admitting 24 MiB (rate 1.6 MiB/s) due to L0 growth (used 32 MiB); write stalls 0

I230823 19:58:25.977418 1013 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n8,s8] 1253  IO overload: compaction score 1.150[L0-overload] (64 ssts, 23 sub-levels), L0 growth 34 MiB (write 34 MiB ingest 74 KiB ignored 37 KiB): requests 475014 (4601 bypassed) with 64 MiB acc-write (5.3 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.52x+1 B (smoothed 0.53x+1 B) + ingested-model 0.00x+1 B (smoothed 0.05x+1 B) + at-admission-tokens 74 B, compacted 34 MiB [≈33 MiB], flushed 1.8 GiB [≈1.9 GiB]; admitting 20 MiB (rate 1.4 MiB/s) due to L0 growth (used 24 MiB); write stalls 0

I230823 20:02:40.981420 1013 util/admission/io_load_listener.go:544 ⋮ [T1,Vsystem,n8,s8] 1458  IO overload: compaction score 0.500 (46 ssts, 10 sub-levels), L0 growth 28 MiB (write 27 MiB ingest 98 KiB ignored 15 KiB): requests 473069 (3321 bypassed) with 64 MiB acc-write (5.1 MiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.00x+1 B) + at-admission-tokens 52 B, compacted 21 MiB [≈21 MiB], flushed 2.6 GiB [≈2.8 GiB]; admitting 21 MiB (rate 1.4 MiB/s) due to L0 growth (used 33 MiB); write stalls 0

L0 is consistently being picked, so the heuristic is "working"

I230823 19:58:20.625876 6826097 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29121  pickAuto: L0->L2
I230823 19:58:20.625876 6826097 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29121 +‹  *L0:   6.4   13.0   13.0    2.7GB     8.4MB     8.0EB  [L0->L2]›
I230823 19:58:20.625876 6826097 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29121 +‹   L2:   2.0    2.0    2.0    130MB     130MB      64MB›
I230823 19:58:20.625876 6826097 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29121 +‹   L3: 248.6  236.9    1.0    107GB     465MB     463MB›
I230823 19:58:20.625876 6826097 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29121 +‹   L4:   1.0    1.0    1.0    3.3GB     3.1GB     3.3GB›
I230823 19:58:20.625876 6826097 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29121 +‹   L5:   1.0    1.0    1.0     24GB      24GB      24GB›
I230823 19:58:20.625876 6826097 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29121 +‹   L6:   1.0    1.0    1.0    163GB     163GB     171GB›

The problem is that the other levels are being starved, and see above that L2 now has 130MB instead of its goal of 64MB, which means more expensive compactions out of L0, for example the following where 29MB of L2 is being rewritten to compact 593KB out of L0!

I230823 19:58:20.381104 6835749 3@pebble/event.go:673 ⋮ [n8,s8,pebble] 29062  [JOB 6327] compacted(default) L0 [038200 038217 038232 038247 038263 038280 038302 038307 038337 038342] (593KB) + L2 [038185 038189 038203 038205 038220] (29MB) -> L2 [038345 038348 038363 038365 038367 038368] (29MB), in 0.9s (0.9s total), output rate 32MB/s

The reason the other levels are being starved is compaction concurrency is not increasing:

node: 8, store: 8
   from: 230823 19:57
     to: 230823 19:57
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        28    54MB        2s
total                                                          28    54MB        2s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2        39       2       0       0      41   600MB   594MB   3.6MB      0B       19s
compact      L2      L3        13       6       0       0      19   496MB   496MB    16MB      0B        5s
compact      L3      L4         2       1       0       0       3    18MB    18MB   3.2MB      0B        0s
compact      L4      L5         1       0       0       0       1    12MB    11MB      0B      0B        0s
compact      L5      L6         1       0       0       0       1    62MB    61MB      0B      0B        1s
total                          56       9       0       0      65   1.2GB   1.2GB    22MB      0B       28s

node: 8, store: 8
   from: 230823 19:57
     to: 230823 19:58
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        61    63MB        2s
total                                                          61    63MB        2s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0         2       0       0       0       2   6.9MB   4.1MB      0B      0B        0s
compact      L0      L2        25       0       0       0      25   843MB   824MB      0B      0B       32s
compact      L6      L6         0       0       0       1       1      0B      0B      0B    16MB        0s
total                          27       0       0       1      28   850MB   828MB      0B    16MB       32s

node: 8, store: 8
   from: 230823 19:58
     to: 230823 19:58
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        84    71MB        3s
ingest               L0                                        14   2.0KB
ingest               L4                                         8    11KB
ingest               L5                                         2   2.7MB
total                                                         108    74MB        3s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2        31       0       0       0      31   1.2GB   1.2GB      0B      0B       50s
compact      L2      L3         1       0       0       0       1    11MB    11MB      0B      0B        0s
total                          32       0       0       0      32   1.2GB   1.2GB      0B      0B       51s

node: 8, store: 8
   from: 230823 19:58
     to: 230823 19:59
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        18    33MB        1s
total                                                          18    33MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0         5       0       0       0       5    15MB    12MB      0B      0B        0s
compact      L0      L2        23       0       0       0      23   480MB   476MB      0B      0B       13s
compact      L2      L3        24       2       0       0      26   463MB   460MB   6.2MB      0B        8s
compact      L3      L4        24       4       0       0      28   1.2GB   1.1GB    23MB      0B        6s
total                          76       6       0       0      82   2.1GB   2.1GB    29MB      0B       29s

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Aug 31, 2023
The default in Pebble, 10, delays increasing the compaction
concurrency which leads to the bad behavior discussed in
cockroachdb/pebble#2832 (comment).
The value of 10 was chosen when admission control was not shaping
incoming traffic until there were 20 sub-levels. Admission control now
shapes regular traffic starting at 5 sub-levels and elastic traffic
starting at 1 sub-level.

Informs cockroachdb#104862
Informs cockroachdb/pebble#2832

Epic: none

Release note: None
@sumeerbhola
Copy link
Collaborator

I looked again at this compensate_higher run and node 8.

Here is the state of the LSM at the start of the increase in sub-levels followed by the state of the LSM near the end of the "overload":

I230823 19:57:17.331495 6603331 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 24385  pickAuto: L0->L2
I230823 19:57:17.331495 6603331 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 24385 +‹  *L0:   1.3    1.0    1.0     59GB     6.5MB     8.0EB›
I230823 19:57:17.331495 6603331 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 24385 +‹   L2:   0.7    0.7    0.7     48MB      48MB      64MB›
I230823 19:57:17.331495 6603331 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 24385 +‹   L3: 122.8  117.1    1.0     53GB     460MB     463MB›
I230823 19:57:17.331495 6603331 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 24385 +‹   L4:   1.0    1.0    1.0    3.3GB     3.1GB     3.3GB›
I230823 19:57:17.331495 6603331 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 24385 +‹   L5:   1.0    1.0    1.0     24GB      24GB      24GB›
I230823 19:57:17.331495 6603331 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 24385 +‹   L6:   1.0    1.0    1.0    163GB     163GB     171GB›


I230823 19:58:28.802973 6852456 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29858  pickAuto: L0->L2
I230823 19:58:28.802973 6852456 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29858 +‹  *L0:   5.0   11.0   11.0    1.3GB     8.0MB     8.0EB  [L0->L2]›
I230823 19:58:28.802973 6852456 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29858 +‹   L2:   2.2    2.2    2.2    140MB     140MB      64MB›
I230823 19:58:28.802973 6852456 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29858 +‹   L3: 248.6  236.9    1.0    107GB     465MB     463MB›
I230823 19:58:28.802973 6852456 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29858 +‹   L4:   1.0    1.0    1.0    3.3GB     3.1GB     3.3GB›
I230823 19:58:28.802973 6852456 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29858 +‹   L5:   1.0    1.0    1.0     24GB      24GB      24GB›
I230823 19:58:28.802973 6852456 3@pebble/compaction_picker.go:1324 ⋮ [n8,s8,pebble] 29858 +‹   L6:   1.0    1.0    1.0    163GB     163GB     171GB›

Note that the bytes in the LSM have not changed much, which is why only L0 and L2 have rawScore > 1.0, even though the only compactions happening are L0=>L2.

So if we ignore this scenario with compensated bytes (due to DELs and RANGEDELs), and consider something like this happening with normal SETs, we can have a situation where we start doing inefficient compactions out of L0, because compaction concurrency is still 1, and therefore only L0 can get picked. So #109780 would be helpful in general, and not just in these compensation scenarios.

But then the question is why are so few bytes being added to the LSM despite so many score driven compactions from L0=>Lbase. The answer turns out to be that we are constantly flushing small memtables, during this time interval, which is not the normal workload. And it is happening because of ingests that overlap with the memtable e.g.

cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:22.552893 6623765 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24602  [JOB 5234] flushed 6 ingested flushables L4:035936 (1.3KB) + L0:035941 (1.0KB) + L4:035937 (1.6KB) + L4:035938 (1.1KB) + L4:035939 (1.1KB) + L6:035940 (1.0KB) in 0.0s (0.0s total), output rate 5.8MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:22.983871 6625346 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24634  [JOB 5241] flushed 1 memtable (7.6MB) to L0 [035960 035969 035970 035971] (1.3MB), in 0.0s (0.0s total), output rate 29MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:22.987510 6625556 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24636  [JOB 5245] flushed 6 ingested flushables L4:035952 (1.3KB) + L0:035957 (1.0KB) + L4:035953 (1.6KB) + L4:035954 (1.1KB) + L4:035955 (1.1KB) + L6:035956 (1.0KB) in 0.0s (0.0s total), output rate 2.4MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:22.994257 6625421 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24642  [JOB 5247] flushed 1 memtable (370KB) to L0 [035972 035973 035974 035975] (104KB), in 0.0s (0.0s total), output rate 18MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:22.996537 6625592 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24644  [JOB 5248] flushed 6 ingested flushables L4:035961 (1.3KB) + L0:035966 (1.0KB) + L4:035962 (2.3KB) + L4:035963 (1.1KB) + L4:035964 (1.1KB) + L6:035965 (1.0KB) in 0.0s (0.0s total), output rate 5.1MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:23.477572 6627273 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24670  [JOB 5253] flushed 1 memtable (7.0MB) to L0 [035995 035996 035997 035998] (1.0MB), in 0.0s (0.0s total), output rate 30MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:23.479952 6627538 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24672  [JOB 5255] flushed 6 ingested flushables L4:035987 (1.4KB) + L0:035992 (1.0KB) + L4:035988 (1.6KB) + L4:035989 (1.1KB) + L4:035990 (1.1KB) + L4:035991 (1.0KB) in 0.0s (0.0s total), output rate 4.2MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:24.434239 6630891 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24742  [JOB 5264] flushed 1 memtable (14MB) to L0 [036015 036033 036034 036035] (1.6MB), in 0.1s (0.1s total), output rate 21MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:24.437292 6631148 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24744  [JOB 5268] flushed 6 ingested flushables L4:036007 (1.3KB) + L0:036012 (1.0KB) + L4:036008 (1.6KB) + L4:036009 (1.1KB) + L4:036010 (1.1KB) + L6:036011 (1.0KB) in 0.0s (0.0s total), output rate 4.3MB/s
cockroach-pebble.arjunnair-1692816704-01-n10cpu16-0008.ubuntu.2023-08-23T19_35_10Z.012476.log:I230823 19:57:24.445339 6631330 3@pebble/event.go:685 ⋮ [n8,s8,pebble] 24750  [JOB 5270] flushed 1 memtable (360KB) to L0 [036036 036037 036038 036039] (110KB), in 0.0s (0.0s total), output rate 15MB/s

Based on looking at the io_load_listener logs (from the previous comment), I suspect this experiment is not running with cockroachdb/cockroach#109332. Is that correct? I have a suspicion that with that PR we would not see a drop in SQL throughput -- we would still do inefficient compactions out of L0, but AC would not throttle.

Can we run with both cockroachdb/cockroach#109332 and cockroachdb/cockroach#109780, on master and see how much the throughput drops -- I wouldn't be surprised if the drop is much smaller.

craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Aug 31, 2023
109780: storage: set L0CompactionConcurrency to 2 r=bananabrick a=sumeerbhola

The default in Pebble, 10, delays increasing the compaction concurrency which leads to the bad behavior discussed in cockroachdb/pebble#2832 (comment). The value of 10 was chosen when admission control was not shaping incoming traffic until there were 20 sub-levels. Admission control now shapes regular traffic starting at 5 sub-levels and elastic traffic starting at 1 sub-level.

Informs #104862

Informs cockroachdb/pebble#2832

Epic: none

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
@bananabrick
Copy link
Contributor Author

bananabrick commented Sep 14, 2023

CompensatedOnlyAsThreshold(raw scores)
cluster=arjunnair-1693452662-01-n10cpu16
grafana=https://grafana.testeng.crdb.io/d/ngnKs_j7x/admission-control?orgId=1&var-cluster=arjunnair-1693452662-01-n10cpu16&var-instances=All&from=1693452766251&to=1693460148228

We see the most minimal impact to foreground throughput out of all the runs. So, this heuristic is working successfully in terms of getting rid of some of the effects of level compensation.

Looking at the logs to try and figure out why we still see a slight dip in foreground throughput to see if we can eliminate that. Taking a look at node 10, which has the highest IO token exhaustion during the dip.

Screenshot 2023-09-14 at 3 27 34 AM

The compaction logs for a random 15s interval prior to the dip looks like:

node: 10, store: 10
   from: 230831 04:20
     to: 230831 04:20
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                         5   320MB        1s
total                                                           5   320MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2        16       1       0       0      17   178MB   175MB   1.6MB      0B        5s
compact      L2      L3         4       3       0       0       7    40MB    39MB   9.1MB      0B        1s
compact      L3      L4         3       2       0       0       5    43MB    42MB   903KB      0B        1s
compact      L4      L5         2       0       0       0       2    40MB    42MB      0B      0B        1s
compact      L5      L6         1       0       0       0       1   100MB    97MB      0B      0B        3s
total                          26       6       0       0      32   401MB   396MB    12MB      0B       12s

One thing I don’t understand is why the “bytes” column for the flush is 320MB, but the “in(B)” column of the L0->L2 compaction is only 175MB. The “bytes” column of the flush should be the total bytes of the sstables written to disk during the flush. The “in(B)” column for the L0->L2 compaction should be the total bytes read from L0 and L2 during the compaction. This means that L0 size should be growing over time. That isn’t the case, so I must be misinterpreting those numbers. Going to ignore the discrepancy for now.

The IO token exhaustion occurs from ~4:34 to ~4:38.

Here’s the admission control logs during the throughput drop:

I230831 04:35:09.953378 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 386  IO overload: compaction score 0.050 (3 ssts, 1 sub-levels), L0 growth 31 MiB (write 31 MiB ingest 0 B ignored 0 B): requests 736903 (1945 bypassed) with 92 MiB acc-write (25 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+0 B (smoothed 1.13x+1 B) + at-admission-tokens 43 B, compacted 30 MiB [≈31 MiB], flushed 3.4 GiB [≈3.4 GiB]; admitting 5.0 GiB (rate 344 MiB/s) (elastic 38 MiB rate 2.6 MiB/s) due to memtable flush (multiplier 1.500) (used total: 47 MiB elastic 0 B); write stalls 0

I230831 04:35:24.953760 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 394  IO overload: compaction score 0.050 (5 ssts, 2 sub-levels), L0 growth 31 MiB (write 31 MiB ingest 1.0 KiB ignored 1.0 KiB): requests 682949 (2791 bypassed) with 86 MiB acc-write (132 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+0 B (smoothed 1.13x+1 B) + at-admission-tokens 45 B, compacted 30 MiB [≈30 MiB], flushed 3.0 GiB [≈3.2 GiB]; admitting 4.8 GiB (rate 326 MiB/s) (elastic 38 MiB rate 2.5 MiB/s) due to memtable flush (multiplier 1.500) (used total: 44 MiB elastic 0 B); write stalls 0

I230831 04:35:39.954178 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 521  IO overload: compaction score 0.300 (59 ssts, 18 sub-levels), L0 growth 33 MiB (write 33 MiB ingest 52 KiB ignored 15 KiB): requests 663282 (3119 bypassed) with 83 MiB acc-write (249 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.56x+1 B) + at-admission-tokens 48 B, compacted 28 MiB [≈29 MiB], flushed 2.6 GiB [≈2.9 GiB]; admitting 41 MiB (rate 2.7 MiB/s) (elastic 18 MiB rate 1.2 MiB/s) due to L0 growth (used total: 43 MiB elastic 0 B); write stalls 0

I230831 04:35:54.954369 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 774  IO overload: compaction score 0.300 (106 ssts, 19 sub-levels), L0 growth 42 MiB (write 41 MiB ingest 259 KiB ignored 42 KiB): requests 648447 (3310 bypassed) with 81 MiB acc-write (357 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.28x+1 B) + at-admission-tokens 57 B, compacted 41 MiB [≈35 MiB], flushed 2.4 GiB [≈2.6 GiB]; admitting 52 MiB (rate 3.5 MiB/s) (elastic 22 MiB rate 1.5 MiB/s) due to L0 growth (used total: 42 MiB elastic 0 B); write stalls 0

I230831 04:36:09.955995 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1060  IO overload: compaction score 0.450 (79 ssts, 29 sub-levels), L0 growth 46 MiB (write 46 MiB ingest 291 KiB ignored 70 KiB): requests 607590 (3581 bypassed) with 76 MiB acc-write (373 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.59x+1 B (smoothed 0.54x+1 B) + ingested-model 0.00x+1 B (smoothed 0.14x+1 B) + at-admission-tokens 67 B, compacted 46 MiB [≈40 MiB], flushed 1.9 GiB [≈2.3 GiB]; admitting 50 MiB (rate 3.3 MiB/s) (elastic 10 MiB rate 688 KiB/s) due to L0 growth (used total: 39 MiB elastic 0 B); write stalls 0

I230831 04:36:24.956321 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1198  IO overload: compaction score 0.650 (114 ssts, 41 sub-levels), L0 growth 42 MiB (write 42 MiB ingest 83 KiB ignored 46 KiB): requests 603004 (3203 bypassed) with 76 MiB acc-write (310 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.55x+1 B (smoothed 0.55x+1 B) + ingested-model 0.00x+1 B (smoothed 0.07x+1 B) + at-admission-tokens 70 B, compacted 35 MiB [≈38 MiB], flushed 2.1 GiB [≈2.2 GiB]; admitting 41 MiB (rate 2.7 MiB/s) (elastic 1 B rate 0 B/s) due to L0 growth (used total: 42 MiB elastic 0 B); write stalls 0

I230831 04:36:39.956600 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1279  IO overload: compaction score 0.150 (39 ssts, 9 sub-levels), L0 growth 31 MiB (write 31 MiB ingest 23 KiB ignored 14 KiB): requests 635424 (2530 bypassed) with 80 MiB acc-write (178 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.52x+1 B) + ingested-model 0.00x+1 B (smoothed 0.04x+1 B) + at-admission-tokens 60 B, compacted 38 MiB [≈38 MiB], flushed 2.8 GiB [≈2.5 GiB]; admitting 3.7 GiB (rate 254 MiB/s) (elastic 38 MiB rate 2.5 MiB/s) due to memtable flush (multiplier 1.500) (used total: 42 MiB elastic 0 B); write stalls 0

I230831 04:36:54.958905 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1381  IO overload: compaction score 0.150 (52 ssts, 10 sub-levels), L0 growth 32 MiB (write 32 MiB ingest 85 KiB ignored 31 KiB): requests 613758 (2689 bypassed) with 77 MiB acc-write (215 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.51x+1 B) + ingested-model 0.00x+1 B (smoothed 0.02x+1 B) + at-admission-tokens 57 B, compacted 33 MiB [≈35 MiB], flushed 2.4 GiB [≈2.4 GiB]; admitting 3.6 GiB (rate 249 MiB/s) (elastic 35 MiB rate 2.4 MiB/s) due to memtable flush (multiplier 1.500) (used total: 44 MiB elastic 0 B); write stalls 0

I230831 04:37:09.960235 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1477  IO overload: compaction score 0.400 (94 ssts, 25 sub-levels), L0 growth 37 MiB (write 37 MiB ingest 74 KiB ignored 25 KiB): requests 642663 (2332 bypassed) with 80 MiB acc-write (141 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.51x+1 B) + ingested-model 0.00x+1 B (smoothed 0.01x+1 B) + at-admission-tokens 58 B, compacted 28 MiB [≈32 MiB], flushed 2.4 GiB [≈2.4 GiB]; admitting 40 MiB (rate 2.7 MiB/s) (elastic 12 MiB rate 810 KiB/s) due to L0 growth (used total: 42 MiB elastic 0 B); write stalls 0

I230831 04:37:24.960539 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1506  IO overload: compaction score 0.250 (32 ssts, 5 sub-levels), L0 growth 29 MiB (write 29 MiB ingest 17 KiB ignored 7.3 KiB): requests 623847 (1968 bypassed) with 78 MiB acc-write (80 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.01x+1 B) + at-admission-tokens 53 B, compacted 25 MiB [≈28 MiB], flushed 3.1 GiB [≈2.7 GiB]; admitting 48 MiB (rate 3.2 MiB/s) (elastic 21 MiB rate 1.4 MiB/s) due to L0 growth (used total: 40 MiB elastic 0 B); write stalls 0

At 4:35:24 we’re admitting 4.8GB every 15 seconds, but at 4:35:39, we end up admitting only 41MB. The difference seems to be a growth in sublevels. While it’s difficult to correlate the timestamps in the AC logs with the timestamps in the pebble compaction logs, around this time, we see some ingestions into L0.

node: 10, store: 10
   from: 230831 04:35
     to: 230831 04:36
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        45   278MB        1s
ingest               L0                                        22    12KB
ingest               L5                                         1   1.3KB
ingest               L6                                         1   1.0KB
total                                                          69   278MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0         3       0       0       0       3   7.3MB   4.8MB      0B      0B        0s
compact      L0      L2        16       0       0       0      16   554MB   538MB      0B      0B       20s
total                          19       0       0       0      19   561MB   543MB      0B      0B       20s

We see 22 tiny ingestions into L0 over a 15 seconds interval(4:35:45-4:36:00). I suspect this is contributing to the IO token exhaustion as you have indicated in your previous comments.

Also, it seems like we’re not using a compaction concurrency of 3, which you also indicated.

node: 10, store: 10
   from: 230831 04:35
     to: 230831 04:36
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        45   278MB        1s
ingest               L0                                        22    12KB
ingest               L5                                         1   1.3KB
ingest               L6                                         1   1.0KB
total                                                          69   278MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0         3       0       0       0       3   7.3MB   4.8MB      0B      0B        0s
compact      L0      L2        16       0       0       0      16   554MB   538MB      0B      0B       20s
total                          19       0       0       0      19   561MB   543MB      0B      0B       20s

node: 10, store: 10
   from: 230831 04:36
     to: 230831 04:36
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        52   257MB        1s
total                                                          52   257MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2        29       0       0       0      29   799MB   775MB      0B      0B       29s
compact      L3      L4         2       7       0       0       9    20MB   2.5MB    30MB      0B        0s
compact      L4      L4         0       0       0       1       1      0B      0B      0B    26GB        0s
compact      L4      L5        10      23       0       0      33    38MB    24MB   122MB      0B        0s
compact      L5      L5         0       0       0       1       1      0B      0B      0B   363KB        0s
compact      L6      L6         0       0       1       3       4    64MB    64MB      0B   192MB        0s
total                          41      30       1       5      77   921MB   866MB   152MB    26GB       30s

The compaction logs also indicate that we’re flushing 100s of MBs every 15 seconds. But the IO load listener logs indicate that the used tokens/admitted tokens every 15 seconds is much less than a 100MB. I don’t understand this discrepancy either. Is this just incorrect write token estimates? @sumeerbhola

There's still some compactions like these, which have tiny L0 files overlapping with largest Lbase files:

I230831 04:13:28.857989 3760 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 368  [JOB 221] compacted(default) L0 [024331 024323 024328 024324 024327] (275KB) + L2 [024325 024326 024261 024262 024263] (17MB) -> L2 [024332 024333 024341 024344 024345] (17MB), in 0.0s (0.1s total), output rate 347MB/s

Note that the compaction occurs at 4:13, which is much before any level compensation kicks in, so I believe the current master will behave identically.

Here's an internal google drive link with the node 10 logs:
https://drive.google.com/file/d/1tf9Pu8KqAcJ7qlYYHeAsDAH2u3qBWwHA/view?usp=drive_link

@bananabrick
Copy link
Contributor Author

Regarding the other comments:

  1. Cap compensation
    pick a heuristic to avoid the effects of large level size compensation #2832 (comment)

I no longer think we should use a cap for compensation. The constant we use is arbitrary, and I've seen customer issues where non-L0 levels have higher scores than L0 due to compensation, but the scores aren't higher than 3.

  1. Compensate higher
    I no longer think we should use this heuristic either. I've lost trust in heuristics which apply asymmetrically across levels. In a way, part of the reason the problem in this issue exists is because the level compensation heuristic applies only to non-L0 levels.

  2. Using raw score to prioritize levels
    This works the best in terms of minimizing throughput drops, and is the simplest to understand. I've posted some analysis/logs in the previous comment. Please take a look.

@jbowens
Copy link
Collaborator

jbowens commented Sep 14, 2023

This means that L0 size should be growing over time. That isn’t the case, so I must be misinterpreting those numbers.

Since this was a short 15s window, is it possible L0 size was growing just over the course of the 15s window but not at a longer time scale? If compactions take multiple seconds, it's not unreasonable to get a high variation like this with such a small time window.

@sumeerbhola
Copy link
Collaborator

I suspect we are not running with cockroachdb/cockroach#109780 since for a substantial part of the overload we are running with 1 compaction at a time out of L0 even though the compaction_picker is seeing things like (in this case MaxDepthAfterOngoingCompactions() is 8, so ccSignal1 := n * p.opts.Experimental.L0CompactionConcurrency should allow for n=3, since 3*2 < 8. There is also the possibility that we are not being able to find concurrent compactions, but first we should confirm that we are running with that PR (that PR merged the evening before this run, so perhaps it was missed).

I230831 04:35:43.095261 4876072 3@pebble/compaction_picker.go:1330 ⋮ [n10,s10,pebble] 30803 +‹  *L0:   7.5    8.0    8.0    187MB     3.8MB     8.0EB›
I230831 04:35:43.095261 4876072 3@pebble/compaction_picker.go:1330 ⋮ [n10,s10,pebble] 30803 +‹   L2:   2.1    2.2    1.1    139MB      69MB      64MB›
I230831 04:35:43.095261 4876072 3@pebble/compaction_picker.go:1330 ⋮ [n10,s10,pebble] 30803 +‹   L3: 413.6  414.9    1.0    187GB     468MB     462MB›
I230831 04:35:43.095261 4876072 3@pebble/compaction_picker.go:1330 ⋮ [n10,s10,pebble] 30803 +‹   L4:   1.1    1.1    1.0    3.6GB     3.3GB     3.3GB›
I230831 04:35:43.095261 4876072 3@pebble/compaction_picker.go:1330 ⋮ [n10,s10,pebble] 30803 +‹   L5:   1.1    1.0    1.0     24GB      24GB      24GB›
I230831 04:35:43.095261 4876072 3@pebble/compaction_picker.go:1330 ⋮ [n10,s10,pebble] 30803 +‹   L6:   1.0    1.0    1.0    162GB     162GB     170GB›

@sumeerbhola
Copy link
Collaborator

I don't see a discrepancy between the flush stats in:

node: 10, store: 10
   from: 230831 04:35
     to: 230831 04:35
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        28    42MB        1s
total                                                          28    42MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2        16       0       0       0      16   450MB   444MB      0B      0B       14s
compact      L2      L3         8       1       0       0       9   264MB   264MB   2.2MB      0B        1s
compact      L3      L4         0       2       0       0       2      0B      0B    10MB      0B        0s
total                          24       3       0       0      27   714MB   707MB    12MB      0B       15s

node: 10, store: 10
   from: 230831 04:35
     to: 230831 04:36
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        45    44MB        1s
ingest               L0                                        22    12KB
ingest               L5                                         1   1.3KB
ingest               L6                                         1   1.0KB
total                                                          69    44MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0         3       0       0       0       3   7.3MB   4.8MB      0B      0B        0s
compact      L0      L2        16       0       0       0      16   554MB   538MB      0B      0B       20s
total                          19       0       0       0      19   561MB   543MB      0B      0B       20s

node: 10, store: 10
   from: 230831 04:36
     to: 230831 04:36
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        52    44MB        1s
total                                                          52    44MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2        29       0       0       0      29   799MB   775MB      0B      0B       29s
compact      L3      L4         2       7       0       0       9    20MB   2.5MB    30MB      0B        0s
compact      L4      L4         0       0       0       1       1      0B      0B      0B    26GB        0s
compact      L4      L5        10      23       0       0      33    38MB    24MB   122MB      0B        0s
compact      L5      L5         0       0       0       1       1      0B      0B      0B   363KB        0s
compact      L6      L6         0       0       1       3       4    64MB    64MB      0B   192MB        0s
total                          41      30       1       5      77   921MB   866MB   152MB    26GB       30s

node: 10, store: 10
   from: 230831 04:36
     to: 230831 04:36
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                        31    39MB        1s
total                                                          31    39MB        1s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L0         6       0       0       0       6    11MB   6.6MB      0B      0B        0s
compact      L0      L2        16       0       0       0      16   684MB   673MB      0B      0B       25s
compact      L2      L3         3       0       0       0       3    55MB    54MB      0B      0B        1s
compact      L3      L4         1     142       0       0     143   7.4MB   6.7MB   171MB      0B        0s
compact      L4      L5        26       2       0       0      28   503MB   119MB    16MB      0B        2s
compact      L5      L5         0       0       0       1       1      0B      0B      0B   4.0MB        0s
compact      L6      L6         0       0       2       4       6    40MB   2.8MB      0B   163MB        1s

and

I230831 04:35:39.954178 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 521  IO overload: compaction score 0.300 (59 ssts, 18 sub-levels), L0 growth 33 MiB (write 33 MiB ingest 52 KiB ignored 15 KiB): requests 663282 (3119 bypassed) with 83 MiB acc-write (249 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.56x+1 B) + at-admission-tokens 48 B, compacted 28 MiB [≈29 MiB], flushed 2.6 GiB [≈2.9 GiB]; admitting 41 MiB (rate 2.7 MiB/s) (elastic 18 MiB rate 1.2 MiB/s) due to L0 growth (used total: 43 MiB elastic 0 B); write stalls 0

I230831 04:35:54.954369 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 774  IO overload: compaction score 0.300 (106 ssts, 19 sub-levels), L0 growth 42 MiB (write 41 MiB ingest 259 KiB ignored 42 KiB): requests 648447 (3310 bypassed) with 81 MiB acc-write (357 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.50x+1 B (smoothed 0.50x+1 B) + ingested-model 0.00x+1 B (smoothed 0.28x+1 B) + at-admission-tokens 57 B, compacted 41 MiB [≈35 MiB], flushed 2.4 GiB [≈2.6 GiB]; admitting 52 MiB (rate 3.5 MiB/s) (elastic 22 MiB rate 1.5 MiB/s) due to L0 growth (used total: 42 MiB elastic 0 B); write stalls 0

I230831 04:36:09.955995 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1060  IO overload: compaction score 0.450 (79 ssts, 29 sub-levels), L0 growth 46 MiB (write 46 MiB ingest 291 KiB ignored 70 KiB): requests 607590 (3581 bypassed) with 76 MiB acc-write (373 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.59x+1 B (smoothed 0.54x+1 B) + ingested-model 0.00x+1 B (smoothed 0.14x+1 B) + at-admission-tokens 67 B, compacted 46 MiB [≈40 MiB], flushed 1.9 GiB [≈2.3 GiB]; admitting 50 MiB (rate 3.3 MiB/s) (elastic 10 MiB rate 688 KiB/s) due to L0 growth (used total: 39 MiB elastic 0 B); write stalls 0

I230831 04:36:24.956321 800 util/admission/io_load_listener.go:593 ⋮ [T1,Vsystem,n10,s10] 1198  IO overload: compaction score 0.650 (114 ssts, 41 sub-levels), L0 growth 42 MiB (write 42 MiB ingest 83 KiB ignored 46 KiB): requests 603004 (3203 bypassed) with 76 MiB acc-write (310 KiB bypassed) + 0 B acc-ingest (0 B bypassed) + write-model 0.55x+1 B (smoothed 0.55x+1 B) + ingested-model 0.00x+1 B (smoothed 0.07x+1 B) + at-admission-tokens 70 B, compacted 35 MiB [≈38 MiB], flushed 2.1 GiB [≈2.2 GiB]; admitting 41 MiB (rate 2.7 MiB/s) (elastic 1 B rate 0 B/s) due to L0 growth (used total: 42 MiB elastic 0 B); write stalls 0

The latter is slightly bigger than flush because there are ingests into L0 (that the compactions tool is not picking up due to a bug).

@sumeerbhola
Copy link
Collaborator

As you noted we have very inefficient compactions. This is very pronounced during the overload e.g.

cockroach-pebble.arjunnair-1693452662-01-n10cpu16-0010.ubuntu.2023-08-31T04_13_24Z.012490.log:I230831 04:35:49.307046 4894736 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 31405  [JOB 6478] compacted(default) L0 [034451 034463 034481 034497 034513 034533] (836KB) + L2 [034499 034500 034501 034510 034519 034520 034521 034534 034535 034536] (38MB) -> L2 [034590 034595 034596 034618 034627 034628 034629 034645 034647 034662] (38MB), in 1.2s (1.2s total), output rate 33MB/s
cockroach-pebble.arjunnair-1693452662-01-n10cpu16-0010.ubuntu.2023-08-31T04_13_24Z.012490.log:I230831 04:35:49.333837 4893095 3@pebble/event.go:673 ⋮ [n10,s10,pebble] 31422  [JOB 6451] compacted(default) L0 [034435 034427 034432 034428 034429 034430 034450 034449 034441 034446 034442 034443 034444 034462 034461 034453 034458 034454 034455 034456 034457 034479 034478 034470 034475 034471 034472 034473 034496 034495 034486 034491 034487 034488 034489 034512 034511 034502 034507 034503 034504 034505 034531 034530] (7.4MB) + L2 [034440 034465 034526 034466 034467 034468 034469 034480 034483 034484 034485 034492] (37MB) -> L2 [034537 034591 034592 034593 034594 034597 034611 034626 034630 034644 034646 034661] (41MB), in 1.8s (1.8s total), output rate 23MB/s

I don't think we should do anything more about this inside Pebble -- the problem is tiny ingests, and we should fix cockroachdb/cockroach#109808

@bananabrick
Copy link
Contributor Author

The discrepancy doesn't occur during the dip in throughput and IO token exhaustion, but before it, and it lasts for a while.

See the logs over 10 minute intervals from 4:10 to 4:30.


node: 10, store: 10
   from: 230831 04:10
     to: 230831 04:20
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                       123   6.7GB       30s
ingest               L0                                       407    22MB
ingest               L2                                         5    74MB
ingest               L3                                         7    95MB
ingest               L4                                        66   1.0GB
ingest               L5                                       313   4.8GB
ingest               L6                                        12   185MB
total                                                         933    13GB       30s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2       405      75       0       0     480   3.9GB   3.9GB   124MB      0B     1m57s
compact      L2      L3        68     145       0       0     213   651MB   642MB   437MB      0B       17s
compact      L3      L4        31     152       0       0     183   390MB   387MB   683MB      0B        9s
compact      L4      L5        85     124       0       0     209   1.7GB   1.7GB   955MB      0B       27s
compact      L5      L6       298       6       0       0     304    18GB    17GB    40MB      0B      2m3s
compact      L6      L6         0       0      52       1      53   2.1GB   2.1GB      0B   4.4MB        8s
total                         887     502      52       1    1442    26GB    26GB   2.2GB   4.4MB      5m4s

node: 10, store: 10
   from: 230831 04:20
     to: 230831 04:30
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                       198    12GB       55s
ingest               L0                                        21    36MB
ingest               L2                                         2    92MB
ingest               L4                                         2   5.0KB
ingest               L5                                        10    18KB
ingest               L6                                         1    50MB
total                                                         234    12GB       55s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2       674     111       0       0     785   9.2GB   9.2GB   190MB      0B     4m40s
compact      L2      L3       224     125       0       0     349   2.4GB   2.4GB   282MB      0B     1m15s
compact      L3      L4        84     134       0       0     218   1.6GB   1.6GB   596MB      0B       33s
compact      L4      L5        82      74       0       0     156   1.8GB   1.8GB   490MB      0B       35s
compact      L5      L5         0       0       0       4       4      0B      0B      0B   222MB        0s
compact      L5      L6        88       4       0       0      92   5.2GB   4.9GB    63MB      0B     2m36s
compact      L6      L6         0       0       7       3      10   159MB   148MB      0B    61MB        0s
total                        1152     448       7       7    1614    20GB    20GB   1.6GB   283MB     9m42s

@sumeerbhola
Copy link
Collaborator

running pebble db logs compactions cockroach-pebble.arjunnair-1693452662-01-n10cpu16-0010.ubuntu.2023-08-31T0* I don't see this:

node: 10, store: 10
   from: 230831 04:10
     to: 230831 04:20
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                       123   705MB       30s
ingest               L0                                       407    22MB
ingest               L2                                         5    74MB
ingest               L3                                         7    95MB
ingest               L4                                        66   1.0GB
ingest               L5                                       313   4.8GB
ingest               L6                                        12   185MB
total                                                         933   6.9GB       30s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2       405      75       0       0     480   3.9GB   3.9GB   124MB      0B     1m57s
compact      L2      L3        68     145       0       0     213   651MB   642MB   437MB      0B       17s
compact      L3      L4        31     152       0       0     183   390MB   387MB   683MB      0B        9s
compact      L4      L5        85     124       0       0     209   1.7GB   1.7GB   955MB      0B       27s
compact      L5      L6       298       6       0       0     304    18GB    17GB    40MB      0B      2m3s
compact      L6      L6         0       0      52       1      53   2.1GB   2.1GB      0B   4.4MB        8s
total                         887     502      52       1    1442    26GB    26GB   2.2GB   4.4MB      5m4s

node: 10, store: 10
   from: 230831 04:20
     to: 230831 04:30
  r-amp: NaN
_kind______from______to_____________________________________count___bytes______time
flush                L0                                       198   1.2GB       55s
ingest               L0                                        21    36MB
ingest               L2                                         2    92MB
ingest               L4                                         2   5.0KB
ingest               L5                                        10    18KB
ingest               L6                                         1    50MB
total                                                         234   1.4GB       55s
_kind______from______to___default____move___elide__delete___count___in(B)__out(B)__mov(B)__del(B)______time
compact      L0      L2       674     111       0       0     785   9.2GB   9.2GB   190MB      0B     4m40s
compact      L2      L3       224     125       0       0     349   2.4GB   2.4GB   282MB      0B     1m15s
compact      L3      L4        84     134       0       0     218   1.6GB   1.6GB   596MB      0B       33s
compact      L4      L5        82      74       0       0     156   1.8GB   1.8GB   490MB      0B       35s
compact      L5      L5         0       0       0       4       4      0B      0B      0B   222MB        0s
compact      L5      L6        88       4       0       0      92   5.2GB   4.9GB    63MB      0B     2m36s
compact      L6      L6         0       0       7       3      10   159MB   148MB      0B    61MB        0s
total                        1152     448       7       7    1614    20GB    20GB   1.6GB   283MB     9m42s

@bananabrick
Copy link
Contributor Author

bananabrick commented Sep 17, 2023

@sumeerbhola

Looks like there is still a dip using the raw scores heuristic on top of the latest pebble master, and running with the latest cockroach master. But the dip is tiny and doesn't last long.

See, https://grafana.testeng.crdb.io/d/ngnKs_j7x/admission-control?orgId=1&var-cluster=arjunnair-1694983689-01-n10cpu16&var-instances=All&from=1694983459572&to=1694992796347.

Most of the IO token exhaustion happens on node 5 from 21:52 to 21:53:15:

Screenshot 2023-09-17 at 7 28 35 PM

The L0 sublevels increases, and then goes back to 0 from 21:50:30 to 21:52:45.
Screenshot 2023-09-17 at 7 43 53 PM

Here's the 15 second logs from that period.

It seems like we are picking compactions almost exclusively out of L0, and we also have high compaction concurrency. The problem is that the compactions picked are super expensive. This seems to correlate with the increased ingestions into L0.

I think it's worthwhile to fix this, but I don't think it should block this issue. I don't think the dip is happening due to this heuristic. Although, it's possible that normally(on master) we would've picked compactions out of the non-L0 levels because of compensation, which would've then prevented the expensive L0 -> Lbase compactions.

running pebble db logs compactions cockroach-pebble.arjunnair-1693452662-01-n10cpu16-0010.ubuntu.2023-08-31T0* I don't see this:

It was probably due to different version of the pebble log tool being used over the same logs. I tried running the latest pebble build + #2906, on the logs from the latest run, and I don't see this discrepancy.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 20, 2023
Small snapshots cause LSM overload by resulting in many tiny memtable
flushes, which result in high sub-level count, which then needs to be
compensated by running many inefficient compactions from L0 to Lbase.
Despite some compaction scoring changes, we have not been able to
fully eliminate impact of this in foreground traffic as discussed in
cockroachdb/pebble#2832 (comment).

Fixes cockroachdb#109808

Epic: none

Release note (ops change): The cluster setting
kv.snapshot.ingest_as_write_threshold controls the size threshold below
which snapshots are converted to regular writes. It defaults to 100KiB.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Sep 24, 2023
Small snapshots cause LSM overload by resulting in many tiny memtable
flushes, which result in high sub-level count, which then needs to be
compensated by running many inefficient compactions from L0 to Lbase.
Despite some compaction scoring changes, we have not been able to
fully eliminate impact of this in foreground traffic as discussed in
cockroachdb/pebble#2832 (comment).

Fixes cockroachdb#109808

Epic: none

Release note (ops change): The cluster setting
kv.snapshot.ingest_as_write_threshold controls the size threshold below
which snapshots are converted to regular writes. It defaults to 100KiB.
bananabrick added a commit that referenced this issue Sep 25, 2023
During large table drops, we see level compensation become drastically
high, which leads to L0 compaction starvation, which in turn leads to
AC queueing of requests and increases in foreground latencies.

We completely ignore compensation during prioritization of levels during
compaction, but we consider compensation when we try and pick a level
for compaction.

The above approach ensures that L0 isn't starved during large table
drops.

Note that there might be scenarios where a level say L3, has a high
compensated score due to a wide range delete, and L3 will also drop
a ton of data in L4, but we end up picking L4 for compaction because
it has a higher raw score.

While this will make some of the compactions picked inefficient, most
of the data which is dropped due to the range delete should be in L6
anyway, and the L5 -> L6 compaction which will contain this range delete
will still be cheap.

We considered some other heuristics in
#2832.

Other approaches considered include capping the level compensation to
a constant, or accounting for compensation in higher levels, but not
lower levels. The first approach might not work well, because the
level compensation can be lower than the constant picked and still
starve out L0. The second approach is more difficult to reason about
compared to the current approach.
@bananabrick
Copy link
Contributor Author

Done in #2917

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Oct 3, 2023
Small snapshots cause LSM overload by resulting in many tiny memtable
flushes, which result in high sub-level count, which then needs to be
compensated by running many inefficient compactions from L0 to Lbase.
Despite some compaction scoring changes, we have not been able to
fully eliminate impact of this in foreground traffic as discussed in
cockroachdb/pebble#2832 (comment).

Fixes cockroachdb#109808

Epic: none

Release note (ops change): The cluster setting
kv.snapshot.ingest_as_write_threshold controls the size threshold below
which snapshots are converted to regular writes. It defaults to 100KiB.
craig bot pushed a commit to cockroachdb/cockroach that referenced this issue Oct 3, 2023
110943: kvserver,storage: ingest small snapshot as writes r=itsbilal,erikgrinaker a=sumeerbhola

Small snapshots cause LSM overload by resulting in many tiny memtable flushes, which result in high sub-level count, which then needs to be compensated by running many inefficient compactions from L0 to Lbase. Despite some compaction scoring changes, we have not been able to fully eliminate impact of this in foreground traffic as discussed in cockroachdb/pebble#2832 (comment).

Fixes #109808

Epic: none

Release note (ops change): The cluster setting
kv.snapshot.ingest_as_write_threshold controls the size threshold below which snapshots are converted to regular writes. It defaults to 100KiB.

111627: encoding: fix UnsafeConvertStringToBytes to work with large input strings r=ecwall a=ecwall

Fixes #111626

The previous impl assumed input string length <= math.MaxInt32. Go 1.20 added unsafe.StringData (https://pkg.go.dev/unsafe#StringData) which properly handles longer strings. This changes the impl to use unsafe.StringData and adds a unit test.

Release note (bug fix): Fixed a panic that could occur if a query uses a string
larger than 2^31-1 bytes.

111656: cluster-ui: pin `pnpm` to `8.6.10` for cluster-ui-release workflow r=THardy98 a=THardy98

Epic: none
This change pins `pnpm` to `8.6.10` for the cluster-ui release (and release-next) workflow(s) to prevent not up-to-date lockfiles when installing cluster-ui dependencies with pnpm.

Release note: None

Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Evan Wall <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

No branches or pull requests

3 participants