roachperf: high core-count regression around February 16th #76738

ajwerner · 2022-02-17T15:58:07Z

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:

https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=aws

https://roachperf.crdb.dev/?filter=&view=kv95%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D32%2Fseq&tab=aws

The same day we see marked improvements in the lower core-count workloads. Presumably this is all due to #76350, but we should bisect and profile to understand.

cc @Azhng 😓

Jira issue: CRDB-13255

ajwerner · 2022-02-17T16:02:33Z

Perhaps this was my advice on how many shards to use.

Azhng · 2022-02-17T16:39:08Z

Hmmm is there a wiki on how to run the bisection ?

Azhng · 2022-02-17T17:11:02Z

Hmm the new fifoCache is for sure faster than the cache.UnorderedCache. I have a feeling that this might be because we drop the batch size from 1024 to 168. I'll play around with this when I get time.

tbg · 2022-02-18T07:30:35Z

Can we treat this with some urgency? A 20% regression has ripple effects across the org, for example @nvanbenschoten was trying to validate a customer workload yesterday (on large machines) and this regression significantly shifts the baseline.

Azhng · 2022-02-18T14:11:45Z

Hi @tbg, this issue is high on my todo list to work on. Meanwhile, you can disable the offending component via
SET CLUSTER SETTING sql.contention.txn_id_cache.max_size = 0 if you have urgent validation to do.

Azhng · 2022-02-19T00:43:22Z

Spent the day investigating this. Interesting observation, I think in high core count machines, the writer is producing writes a lot faster than the single background goroutine can keep up. Once the channel is filled, and the goroutine cannot keep up, it produces back pressure, and slowing down the writers.

I did a few runs with different configuration, here is the result that I have observed:

(note: 16-128-168 here means 16 shards writer, channel of size 128, and batch size of 168)

configuration	relative performance
disabled-via-cluster-setting	100.00%
enabled-cache-fifo-16-128-168 (master as of Feb 16th)	84.35%
enabled-cache-UnorderedCache-16-128-1024 (master before Feb 16th)	87.78%
tuned-64-512-2048	91.85%
tuned-64-512-2048-3-goroutines	102.39% (I assume +/- 2% is within margin of uncertainty?)

I haven't tested different # goroutine + buffer size configurations, so this might be a complete overkill, but it seems like it has eliminated the perf drop seen in 32-core machines.

@ajwerner thoughts on pursuing down this path to solve the perf issue here?

Raw benchmark data.

Branch	elapsed	ops(total)	ops/sec(cum)	avg(ms)	p50(ms)	p95(ms)	p99(ms)	pMax(ms)
disabled-baseline	2700.0s	200126759	74121.0	0.9	0.7	2.0	4.5	41.9
enabled-fifo-cache-16-128-168	2700.0s	168811009	62522.6	1.0	0.7	2.4	9.4	125.8
eanbled-unorderd-cache-16-128-1024	2700.0s	175668637	65062.4	1.0	0.7	2.5	5.0	151.0
64-512-2048	2700.0s	183813891	68079.2	0.9	0.7	2.4	6.6	79.7
64-512-2048-3-gourintes	2700.0s	204900223	75889.0	0.8	0.7	1.9	3.7	37.7

erikgrinaker · 2022-02-19T17:43:08Z

Should we consider disabling this setting by default until we get to the bottom of it?

erikgrinaker · 2022-02-24T09:20:31Z

@Azhng Thoughts on disabling this? It can mask other performance issues, and as we're moving into stability we'll need to start addressing them across the board.

See cockroachdb#76738. Release note: None

76973: txnidcache: disable cache by default r=erikgrinaker a=tbg See #76738. Release note: None Co-authored-by: Tobias Grieger <[email protected]>

tbg · 2022-02-24T14:57:51Z

The regression should now be "fixed" by defaulting the cluster setting to zero (#76973 (comment)). We'll need to add an annotation to roachperf.

See cockroachdb#76738. Release note: None

Azhng · 2022-03-01T04:19:14Z

The heap profile here is pointing the finger at the eviction list in the FIFO store. Somehow that's creating a lot of objects on the heap. Hmm I was under the impression that using the sync.Pool was able to reduce those allocations?

This is with capacity limit set to 64MB

Hmm though this doesn't quite explain how running 3 copies of txnIDCache in 3 different goroutines was able to improve the situation. 🤔

Profile:

pprof.cockroach.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz

ajwerner · 2022-03-01T14:10:43Z

Those 208B blocks under fifoCache.add scare me. The big block is the map itself I suspect. The rest is the sync.Pool

ajwerner · 2022-03-01T14:16:24Z

It seems to me like you're not accounting for the memory properly here. Namely, you need to account for the blocks themselves which are in use

This change does two things to the txnidcache: 1) It accounts for the space used by the fifo eviction list. Previously we'd use more than double the intended space. We should probably also subtrace out the size of the buffers we're currently filling and the channel we use to communicate them, but I'll leave that for later. 2) It stops trying to compact the blocks. Compacting the blocks ends up being a good deal of overhead because we have to copy across every single message. Instead we can just append the block directly to the list. This does have the hazard of wasting a lot of space when the blocks are sparse. However, if the blocks are sparse, we know that the throughput is low, so it's fine. This is DNM because the tests need to change. Touches cockroachdb#76738 Release justification: bug fixes and low-risk updates to new functionality Release note: None

This change does two things to the txnidcache: 1) It accounts for the space used by the fifo eviction list. Previously we'd use more than double the intended space. We should probably also subtrace out the size of the buffers we're currently filling and the channel we use to communicate them, but I'll leave that for later. 2) It stops trying to compact the blocks. Compacting the blocks ends up being a good deal of overhead because we have to copy across every single message. Instead we can just append the block directly to the list. This does have the hazard of wasting a lot of space when the blocks are sparse. However, if the blocks are sparse, we know that the throughput is low, so it's fine. Resolves cockroachdb#76738 Release justification: bug fixes and low-risk updates to new functionality Release note: None

See cockroachdb#76738. Release note: None

77208: sql: update test that was fooling itself r=ajwerner a=ajwerner I have no clue what is going on in #76843 but this test was fooling itself regarding the existence of separate connections. Release justification: non-production code changes Release note: None 77220: sql/contention/txnidcache: reuse blocks in list, account for space r=maryliag,ajwerner a=ajwerner This change does two things to the txnidcache: 1) It accounts for the space used by the fifo eviction list. Previously we'd use more than double the intended space. We should probably also subtrace out the size of the buffers we're currently filling and the channel we use to communicate them, but I'll leave that for later. 2) It stops trying to compact the blocks. Compacting the blocks ends up being a good deal of overhead because we have to copy across every single message. Instead we can just append the block directly to the list. This does have the hazard of wasting a lot of space when the blocks are sparse. However, if the blocks are sparse, we know that the throughput is low, so it's fine. Resolves #76738 Release justification: bug fixes and low-risk updates to new functionality Release note: None 77363: sql/delegate: avoid extra string->int parsing r=otan a=rafiss Release justification: low risk improvement Release note: None 77438: ui: Remove stray parenthesis in Jobs page r=jocrl a=jocrl Addresses #77440. This commit fixes the stray parenthesis at the end of the duration time for a succeeded job. The parenthesis had been introduced in #76691 and the 21.2 backport #73624. Before: ![image](https://user-images.githubusercontent.com/91907326/157065776-456c8f7d-1958-4192-b38d-dcb40432cf9d.png) After: ![image](https://user-images.githubusercontent.com/91907326/157065785-e3f2db6a-67d1-4ae3-87cb-df71dccf0e5f.png) Release note (ui): Remove stray parenthesis at the end of the duration time for a succeeded job. It had been accidentally introduced to unreleased master and a 21.2 backport. Release justification: Category 2, UI bug fix Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Josephine Lee <[email protected]>

Azhng self-assigned this Feb 17, 2022

Azhng added A-sql-observability Related to observability of the SQL layer T-sql-observability labels Feb 19, 2022

Azhng mentioned this issue Feb 23, 2022

sql: contention event store main tracking issue #74485

Closed

20 tasks

tbg mentioned this issue Feb 24, 2022

txnidcache: disable cache by default #76973

Merged

tbg added a commit to tbg/cockroach that referenced this issue Feb 24, 2022

txnidcache: disable cache by default

28b72f0

See cockroachdb#76738. Release note: None

craig bot pushed a commit that referenced this issue Feb 24, 2022

Merge #76973

9dc9eb0

76973: txnidcache: disable cache by default r=erikgrinaker a=tbg See #76738. Release note: None Co-authored-by: Tobias Grieger <[email protected]>

maryliag pushed a commit to maryliag/cockroach that referenced this issue Feb 28, 2022

txnidcache: disable cache by default

822a25d

See cockroachdb#76738. Release note: None

ajwerner mentioned this issue Mar 1, 2022

sql/contention/txnidcache: reuse blocks in list, account for space #77220

Merged

RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022

txnidcache: disable cache by default

9d66119

See cockroachdb#76738. Release note: None

craig bot closed this as completed in 7dd272a Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachperf: high core-count regression around February 16th #76738

roachperf: high core-count regression around February 16th #76738

ajwerner commented Feb 17, 2022 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Feb 17, 2022

Azhng commented Feb 17, 2022

Azhng commented Feb 17, 2022

tbg commented Feb 18, 2022

Azhng commented Feb 18, 2022 •

edited

Loading

Azhng commented Feb 19, 2022 •

edited

Loading

erikgrinaker commented Feb 19, 2022

erikgrinaker commented Feb 24, 2022

tbg commented Feb 24, 2022

Azhng commented Mar 1, 2022 •

edited

Loading

ajwerner commented Mar 1, 2022

ajwerner commented Mar 1, 2022

roachperf: high core-count regression around February 16th #76738

roachperf: high core-count regression around February 16th #76738

Comments

ajwerner commented Feb 17, 2022 • edited by cockroach-jira-scripts Loading

ajwerner commented Feb 17, 2022

Azhng commented Feb 17, 2022

Azhng commented Feb 17, 2022

tbg commented Feb 18, 2022

Azhng commented Feb 18, 2022 • edited Loading

Azhng commented Feb 19, 2022 • edited Loading

erikgrinaker commented Feb 19, 2022

erikgrinaker commented Feb 24, 2022

tbg commented Feb 24, 2022

Azhng commented Mar 1, 2022 • edited Loading

ajwerner commented Mar 1, 2022

ajwerner commented Mar 1, 2022

ajwerner commented Feb 17, 2022 •

edited by cockroach-jira-scripts

Loading

Azhng commented Feb 18, 2022 •

edited

Loading

Azhng commented Feb 19, 2022 •

edited

Loading

Azhng commented Mar 1, 2022 •

edited

Loading