Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/contention/txnidcache: reuse blocks in list, account for space #77220

Merged
merged 1 commit into from
Mar 7, 2022

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Mar 1, 2022

This change does two things to the txnidcache:

  1. It accounts for the space used by the fifo eviction list. Previously
    we'd use more than double the intended space. We should probably also
    subtrace out the size of the buffers we're currently filling and the
    channel we use to communicate them, but I'll leave that for later.
  2. It stops trying to compact the blocks. Compacting the blocks ends up
    being a good deal of overhead because we have to copy across every
    single message. Instead we can just append the block directly to the
    list. This does have the hazard of wasting a lot of space when the
    blocks are sparse. However, if the blocks are sparse, we know that the
    throughput is low, so it's fine.

Resolves #76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None

@ajwerner ajwerner requested a review from Azhng March 1, 2022 14:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 1, 2022

@Azhng what do you think of this? I didn't update the tests. It seemed sort of painful. It might be a similar amount of effort to port that failing test to be datadriven. It'd be an improvement.

Copy link
Contributor

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I will run this through kv95 to see what's the performance impact. If it looks good I will go ahead and port this to datadriven test.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@Azhng Azhng force-pushed the ajwerner/txnidcache-cheaper branch from 8ac79c0 to edf6f89 Compare March 2, 2022 22:17
@Azhng Azhng changed the title [DNM] sql/contention/txnidcache: reuse blocks in list, account for space sql/contention/txnidcache: reuse blocks in list, account for space Mar 2, 2022
@Azhng
Copy link
Contributor

Azhng commented Mar 2, 2022

kv95 (3 node with n1-highcpu-32 nodes):

Workload configuration:

bin/roachprod run $CLUSTER:4 -- ./cockroach workload init kv --splits 1000 --read-percent 95 {pgurl:1-3}
bin/roachprod run $CLUSTER:4 -- ./cockroach workload run kv --read-percent 95 --concurrency 192 --duration 30m {pgurl:1-3} 

TxnID Cache disabled:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0      217076813       120598.2      1.6      1.2      4.7     10.0    109.1  

TxnID Cache enabled:

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 1800.0s        0      231054734       128363.7      1.5      1.2      3.8      9.4    121.6  

@Azhng Azhng requested a review from a team March 2, 2022 22:23
@Azhng
Copy link
Contributor

Azhng commented Mar 2, 2022

Added datadriven test.

Comment on lines 57 to 67
inspectEvictionListShape
----
blockIdx: 0, blockSize: 42
blockIdx: 1, blockSize: 84
blockIdx: 2, blockSize: 168

ensureEvictionBlockOrdering names=block1,block2,block3
----
blockIdx: 0, blockName: block1
blockIdx: 1, blockName: block2
blockIdx: 2, blockName: block3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you're missing out on the power of datadriven tests. The point is that you visually verify your expectations when you read this file and you can rewrite it.

It seems like all of verifyBlockExistence, inspectEvictionListShape and ensureEvictionBlockOrdering all should just be something like show

show
----
block1  42
block2  84
block3  168

That tells you the order and the size and the indexes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also removes the need for verifyBlockNonExistence

Copy link
Contributor

@Azhng Azhng Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the test so that show will output the eviction list and a dump of the in-memory map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do you think the output of the in-memory map is a bit too verbose ?

@Azhng Azhng force-pushed the ajwerner/txnidcache-cheaper branch from edf6f89 to e2b5e1e Compare March 2, 2022 23:35
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)


pkg/sql/contention/txnidcache/testdata/fifo_cache, line 67 at r2 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Hmm, do you think the output of the in-memory map is a bit too verbose ?

way too verbose, it was not my suggestion. My suggestion was just show the block names and their number of elements.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)


pkg/sql/contention/txnidcache/testdata/fifo_cache, line 67 at r2 (raw file):

Previously, ajwerner wrote…

way too verbose, it was not my suggestion. My suggestion was just show the block names and their number of elements.

Even the indexes are superfluous.

Copy link
Contributor

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @Azhng)


pkg/sql/contention/txnidcache/testdata/fifo_cache, line 67 at r2 (raw file):

Previously, ajwerner wrote…

Even the indexes are superfluous.

Hmm I see. Just to clarify, how should this test go about testing the content within the in-memory map? Should the check be performed while show is populating the output?

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng)


pkg/sql/contention/txnidcache/testdata/fifo_cache, line 67 at r2 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Hmm I see. Just to clarify, how should this test go about testing the content within the in-memory map? Should the check be performed while show is populating the output?

For the contents you can have a different command which probes? Were you testing the contents before?

@Azhng Azhng force-pushed the ajwerner/txnidcache-cheaper branch from e2b5e1e to fae5171 Compare March 3, 2022 18:26
Copy link
Contributor

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @Azhng)


pkg/sql/contention/txnidcache/testdata/fifo_cache, line 67 at r2 (raw file):

Previously, ajwerner wrote…

For the contents you can have a different command which probes? Were you testing the contents before?

Ah, I see the where the confusion is coming from 🤦 . verifyBlockExistence was the one that's doing the check on the in-memroy map content. Though I should definitely rename that. 👍

@Azhng Azhng force-pushed the ajwerner/txnidcache-cheaper branch 2 times, most recently from 6ec05fe to 2a3830b Compare March 3, 2022 18:48
Copy link
Contributor

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @Azhng)


pkg/sql/contention/txnidcache/testdata/fifo_cache, line 67 at r2 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Ah, I see the where the confusion is coming from 🤦 . verifyBlockExistence was the one that's doing the check on the in-memroy map content. Though I should definitely rename that. 👍

Updated.

This change does two things to the txnidcache:
1) It accounts for the space used by the fifo eviction list. Previously
   we'd use more than double the intended space. We should probably also
   subtrace out the size of the buffers we're currently filling and the
   channel we use to communicate them, but I'll leave that for later.
2) It stops trying to compact the blocks. Compacting the blocks ends up
   being a good deal of overhead because we have to copy across every
   single message. Instead we can just append the block directly to the
   list. This does have the hazard of wasting a lot of space when the
   blocks are sparse. However, if the blocks are sparse, we know that the
   throughput is low, so it's fine.

Resolves cockroachdb#76738

Release justification: bug fixes and low-risk updates to new functionality

Release note: None
@Azhng Azhng force-pushed the ajwerner/txnidcache-cheaper branch from 2a3830b to 7dd272a Compare March 4, 2022 00:58
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @Azhng)

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 3 files at r1, 1 of 2 files at r5, 1 of 1 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner)

@Azhng
Copy link
Contributor

Azhng commented Mar 7, 2022

TFTR!

bors r=maryliag,ajwerner

@craig
Copy link
Contributor

craig bot commented Mar 7, 2022

Build succeeded:

@craig craig bot merged commit b5bbbed into cockroachdb:master Mar 7, 2022
Azhng added a commit to Azhng/cockroach that referenced this pull request Mar 10, 2022
Previously, TxnID cache was disabled by default due to performance
issues. This has since been resolved in cockroachdb#77220.
This commit re-enables TxnID Cache by default.

Release justification: low risk high reward change.
Release note: None
Azhng added a commit to Azhng/cockroach that referenced this pull request Mar 14, 2022
Previously, TxnID cache was disabled by default due to performance
issues. This has since been resolved in cockroachdb#77220.
This commit re-enables TxnID Cache by default.

Release justification: low risk high reward change.
Release note: None
craig bot pushed a commit that referenced this pull request Mar 14, 2022
77442: sqlproxyccl: Add codeUnavailable to the list of error codes r=alyshanjahani-crl a=alyshanjahani-crl

Release justification: fixes for high-priority bug in existing functionality

Previously, if a tenant cluster had maxPods set to 0 the error returned by
directory.EnsureTenantAddr was not treated as a non-retryable error.

The tenant directory implementation used in CockroachCloud now identifies
this situation and returns a status error with codes.FailedPrecondition and
a descriptive message.

This patch adds a check for the FailedPrecondition error in
connector.lookupAddr.

Release note: None

77626: sql: re-enable txnidcache by default r=maryliag,ajwerner a=Azhng

Previously, TxnID cache was disabled by default due to performance
issues. This has since been resolved in #77220.
This commit re-enables TxnID Cache by default.

Release justification: low risk high reward change.
Release note: None

77675: cmd/roachtest: tweaks to sstable corruption test r=itsbilal a=nicktrav

A few minor changes to the existing roachtest:

Reduce the warehouse count on ingest back to 100 - this should be ample,
and it matches the warehouse count on the read path in a later part of
the test.

Change the stop options to wait until the process has been terminated
before continuing.

Re-work the command to search for SSTables to corrupt - use tables from
the most recent manifest (if multiple are present); consider SSTables
with either a start or end key representing a user table; shuffle tables
to distribute corruption over the LSM; perform filtering in the bash
command rather than in the test runner itself.

Release justification: Roachtest-only change.

Release note: None.

Touches #77321.

77700: ccl/sqlproxyccl: add connection migration-related metrics r=JeffSwenson a=jaylim-crl

#### ccl/sqlproxyccl: add metric that measures connection migration latency

Previously, we added support for connection migration in #76805. This commit
adds a new `proxy.conn_migration.attempted.latency` metric that tracks latency
for attempted connection migrations. Having this metric would be beneficial
as we can now know how long users were blocked during connection migrations.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None

#### ccl/sqlproxyccl: add metric that records the transfer response message size

Informs #76000, and follow up to #76805.

This commit adds a new proxy.conn_migration.transfer_response.message_size
metric that will track the distribution of the transfer response message size.
This will be used to tune a value for the SQL cluster setting:
sql.session_transfer.max_session_size.

Release justification: Low-risk metric-only change within sqlproxy.

Release note: None

Co-authored-by: Alyshan Jahani <[email protected]>
Co-authored-by: Azhng <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Jay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachperf: high core-count regression around February 16th
4 participants