Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachperf: high core-count regression around February 16th #76738

Closed
Tracked by #74485
ajwerner opened this issue Feb 17, 2022 · 12 comments · Fixed by #77220
Closed
Tracked by #74485

roachperf: high core-count regression around February 16th #76738

ajwerner opened this issue Feb 17, 2022 · 12 comments · Fixed by #77220
Assignees
Labels
A-sql-observability Related to observability of the SQL layer branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Feb 17, 2022

Describe the problem

Please describe the issue you observed, and any steps we can take to reproduce it:
Screen Shot 2022-02-17 at 10 54 44 AM
https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=aws
Screen Shot 2022-02-17 at 10 55 27 AM
https://roachperf.crdb.dev/?filter=&view=kv95%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D32%2Fseq&tab=aws

The same day we see marked improvements in the lower core-count workloads. Presumably this is all due to #76350, but we should bisect and profile to understand.

cc @Azhng 😓

Jira issue: CRDB-13255

@ajwerner ajwerner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. branch-master Failures and bugs on the master branch. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Feb 17, 2022
@ajwerner
Copy link
Contributor Author

Perhaps this was my advice on how many shards to use.

@Azhng Azhng self-assigned this Feb 17, 2022
@Azhng
Copy link
Contributor

Azhng commented Feb 17, 2022

Hmmm is there a wiki on how to run the bisection ?

@Azhng
Copy link
Contributor

Azhng commented Feb 17, 2022

Hmm the new fifoCache is for sure faster than the cache.UnorderedCache. I have a feeling that this might be because we drop the batch size from 1024 to 168. I'll play around with this when I get time.

@tbg
Copy link
Member

tbg commented Feb 18, 2022

Can we treat this with some urgency? A 20% regression has ripple effects across the org, for example @nvanbenschoten was trying to validate a customer workload yesterday (on large machines) and this regression significantly shifts the baseline.

@Azhng
Copy link
Contributor

Azhng commented Feb 18, 2022

Hi @tbg, this issue is high on my todo list to work on. Meanwhile, you can disable the offending component via
SET CLUSTER SETTING sql.contention.txn_id_cache.max_size = 0 if you have urgent validation to do.

@Azhng
Copy link
Contributor

Azhng commented Feb 19, 2022

Spent the day investigating this. Interesting observation, I think in high core count machines, the writer is producing writes a lot faster than the single background goroutine can keep up. Once the channel is filled, and the goroutine cannot keep up, it produces back pressure, and slowing down the writers.

I did a few runs with different configuration, here is the result that I have observed:

(note: 16-128-168 here means 16 shards writer, channel of size 128, and batch size of 168)

configuration relative performance
disabled-via-cluster-setting 100.00%
enabled-cache-fifo-16-128-168 (master as of Feb 16th) 84.35%
enabled-cache-UnorderedCache-16-128-1024 (master before Feb 16th) 87.78%
tuned-64-512-2048 91.85%
tuned-64-512-2048-3-goroutines 102.39% (I assume +/- 2% is within margin of uncertainty?)

I haven't tested different # goroutine + buffer size configurations, so this might be a complete overkill, but it seems like it has eliminated the perf drop seen in 32-core machines.

@ajwerner thoughts on pursuing down this path to solve the perf issue here?

Raw benchmark data.

Branch elapsed errors ops(total) ops/sec(cum) avg(ms) p50(ms) p95(ms) p99(ms) pMax(ms)
disabled-baseline 2700.0s 0 200126759 74121.0 0.9 0.7 2.0 4.5 41.9
enabled-fifo-cache-16-128-168 2700.0s 0 168811009 62522.6 1.0 0.7 2.4 9.4 125.8
eanbled-unorderd-cache-16-128-1024 2700.0s 0 175668637 65062.4 1.0 0.7 2.5 5.0 151.0
64-512-2048 2700.0s 0 183813891 68079.2 0.9 0.7 2.4 6.6 79.7
64-512-2048-3-gourintes 2700.0s 0 204900223 75889.0 0.8 0.7 1.9 3.7 37.7

@Azhng Azhng added A-sql-observability Related to observability of the SQL layer T-sql-observability labels Feb 19, 2022
@erikgrinaker
Copy link
Contributor

Should we consider disabling this setting by default until we get to the bottom of it?

@erikgrinaker
Copy link
Contributor

@Azhng Thoughts on disabling this? It can mask other performance issues, and as we're moving into stability we'll need to start addressing them across the board.

tbg added a commit to tbg/cockroach that referenced this issue Feb 24, 2022
craig bot pushed a commit that referenced this issue Feb 24, 2022
76973: txnidcache: disable cache by default r=erikgrinaker a=tbg

See #76738.

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg
Copy link
Member

tbg commented Feb 24, 2022

The regression should now be "fixed" by defaulting the cluster setting to zero (#76973 (comment)). We'll need to add an annotation to roachperf.

maryliag pushed a commit to maryliag/cockroach that referenced this issue Feb 28, 2022
@Azhng
Copy link
Contributor

Azhng commented Mar 1, 2022

The heap profile here is pointing the finger at the eviction list in the FIFO store. Somehow that's creating a lot of objects on the heap. Hmm I was under the impression that using the sync.Pool was able to reduce those allocations?

Screen Shot 2022-02-28 at 23 15 44

This is with capacity limit set to 64MB

Hmm though this doesn't quite explain how running 3 copies of txnIDCache in 3 different goroutines was able to improve the situation. 🤔

Profile:

pprof.cockroach.alloc_objects.alloc_space.inuse_objects.inuse_space.004.pb.gz

@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 1, 2022

Those 208B blocks under fifoCache.add scare me. The big block is the map itself I suspect. The rest is the sync.Pool

@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 1, 2022

It seems to me like you're not accounting for the memory properly here. Namely, you need to account for the blocks themselves which are in use

ajwerner added a commit to ajwerner/cockroach that referenced this issue Mar 1, 2022
This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

This is DNM because the tests need to change.

Touches cockroachdb#76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
Azhng pushed a commit to Azhng/cockroach that referenced this issue Mar 2, 2022
This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

Resolves cockroachdb#76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
Azhng pushed a commit to ajwerner/cockroach that referenced this issue Mar 2, 2022
This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

Resolves cockroachdb#76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
Azhng pushed a commit to ajwerner/cockroach that referenced this issue Mar 3, 2022
This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

Resolves cockroachdb#76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
Azhng pushed a commit to ajwerner/cockroach that referenced this issue Mar 3, 2022
This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

Resolves cockroachdb#76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
Azhng pushed a commit to ajwerner/cockroach that referenced this issue Mar 3, 2022
This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

Resolves cockroachdb#76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
craig bot pushed a commit that referenced this issue Mar 7, 2022
77208: sql: update test that was fooling itself r=ajwerner a=ajwerner

I have no clue what is going on in #76843 but this test was fooling itself
regarding the existence of separate connections.

Release justification: non-production code changes

Release note: None

77220: sql/contention/txnidcache: reuse blocks in list, account for space r=maryliag,ajwerner a=ajwerner

This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

Resolves #76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None

77363: sql/delegate: avoid extra string->int parsing r=otan a=rafiss

Release justification: low risk improvement
Release note: None

77438: ui: Remove stray parenthesis in Jobs page r=jocrl a=jocrl

Addresses #77440.

This commit fixes the stray parenthesis at the end of the duration time for a
succeeded job. The parenthesis had been introduced in
#76691 and the 21.2 backport
#73624.

Before:
![image](https://user-images.githubusercontent.com/91907326/157065776-456c8f7d-1958-4192-b38d-dcb40432cf9d.png)

After:
![image](https://user-images.githubusercontent.com/91907326/157065785-e3f2db6a-67d1-4ae3-87cb-df71dccf0e5f.png)


Release note (ui): Remove stray parenthesis at the end of the duration time for
a succeeded job. It had been accidentally introduced to unreleased master and a
21.2 backport.

Release justification: Category 2, UI bug fix

Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Josephine Lee <[email protected]>
@craig craig bot closed this as completed in 7dd272a Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-observability Related to observability of the SQL layer branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants