Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sstable: revisit writer block flushing heuristics #999

Closed
petermattis opened this issue Nov 17, 2020 · 6 comments · Fixed by #3508
Closed

sstable: revisit writer block flushing heuristics #999

petermattis opened this issue Nov 17, 2020 · 6 comments · Fixed by #3508

Comments

@petermattis
Copy link
Collaborator

sstable.shouldFlush contains the heuristics for deciding whether a block should be flushed or not during sstable construction. The intention is to flush a block before it reaches the configured blockSize. The block size controls how large the block will be in memory (on disk the block is compressed and may be significantly smaller). For CRDB, a block size of 32 KiB is used. What is interesting is how this block size interacts with jemalloc (the memory allocator linked into CRDB for C memory allocations). Jemalloc has many size classes and an allocation is always performed in the smallest size class that will hold it. Of interest to the block cache are the size classes 24 KiB, 28 KiB, 32 KiB, 40 KiB. If a block is just a tiny bit larger than 32 KiB it will be allocated from the 40 KiB size class which will waste ~25% of the space. Much better for a block to be just a bit smaller than 32 KiB.

The shouldFlush heuristic attempts to flush a block just before it grows larger than the target block size. But there is a second heuristic that says "don't flush a block if it is smaller than 99% of the target block size". (Note this heuristic was inherited from RocksDB). 99% of 32 KiB is 31.68 KiB, a difference of only 328 bytes. So if we have a key/value pair that is larger than 328 bytes we'll flush the block when it is just a little bit larger than 32 KiB. If we want to minimize internal fragmentation in the block cache, we should instead allow the block to be flushed earlier. If the block was flushed at just over 28 KiB internal fragmentation would be 14%.

Making shouldFlush aware of the jemalloc size classes could allow significantly reduce this memory wastage. Does this matter in practice? Maybe. CRDB is frequently run with multi-gigabyte block cache sizes. The internal fragmentation is not accounted for in the block cache size which makes memory usage higher than expected. Smarter block sizing heuristics could bring a tighter bound on memory usage which we could use to reduce the CRDB memory footprint, or we could increase the block cache size in order to improve read performance. With a multi-gigabyte block cache we're talking about hundreds of megabytes of memory.

We'd want to make the allocator size class knowledge a configurable so that we don't hard code something specific to jemalloc.

@petermattis
Copy link
Collaborator Author

Also related to internal fragmentation is this TODO in internal/cache/value_normal.go:

func newValue(n int) *Value {
	if n == 0 {
		return nil
	}
	// When we're not performing leak detection, the lifetime of the returned
	// Value is exactly the lifetime of the backing buffer and we can manually
	// allocate both.
	//
	// TODO(peter): It may be better to separate the allocation of the value and
	// the buffer in order to reduce internal fragmentation in malloc. If the
	// buffer is right at a power of 2, adding valueSize might push the
	// allocation over into the next larger size.
	b := manual.New(valueSize + n)
	v := (*Value)(unsafe.Pointer(&b[0]))
	v.buf = b[valueSize:]
	v.ref.init(1)
	return v
}

@petermattis
Copy link
Collaborator Author

I did a bit of analysis on an imported TPCC-100 dataset. The following table shows the uncompressed data block sizes bucketed by the jemalloc class sizes. count is the count of the number of blocks that fall in that class size. wasted is the average bytes wasted per block due to the actual block size being smaller than the class size. total space = class size * count. wasted space = wasted * count.

class size count wasted total space wasted space
28 KB 49 2121.0 1.3 MB 0.1 MB
32 KB 294530 145.3 9204.1 MB 40.8 MB
40 KB 0 0 0 0

This looks great. The wasted space is quite small. But if we include the size of the Value struct which is allocated contiguously with the memory for the block, a different picture emerges:

class size count wasted total space wasted space
28 KB 49 2089.0 1.3 MB 0.1 MB
32 KB 221486 154.5 6921.4 MB 32.6 MB
40 KB 73044 8180.6 2853.3 MB 569.9 MB

Wasted space is >10x higher in this scenario. It looks fairly straightforward to reclaim this wasted space.

@github-actions
Copy link

github-actions bot commented Jun 6, 2022

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it
in 10 days to keep the issue queue tidy. Thank you for your
contribution to Pebble!

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Nov 27, 2023

Internal fragmentation (the topic of this issue) is not tracked in jemalloc stats, but we can see it when varying size of the value in a kv50 workload. There are three runs below, (1) used 4096 byte values that are not compressible, (2) uses 4096 byte values with --target-compression-ratio=3, (3) use 1024 byte values with --target-compression-ratio=3. Note that the rocksdb.block.cache.usage stabilizes to the same in all three runs, but the allocated bytes in run 1 and 2 are much higher. Looking at the detailed jemalloc stats, most of the allocated bytes are in size class 40960 in run 1 and run 2, while in run 3 most of the allocated bytes are in size class 32768. The difference between allocbytes and totalbytes (external fragmentation) is similar in all three runs.

Screenshot 2023-11-27 at 5 37 38 PM

@petermattis
Copy link
Collaborator Author

Does the compressibility of the data actually matter here given that the block cache stores uncompressed blocks?

My recollection of the TPC-C analysis I did above was that I dumped out the block sizes for all of the sstables using the pebble sstable tool (possibly with some custom tweaks, I can't recall). Looking at the Pebble shouldFlush code, I have a suspicion that this code may be problematic:

	// The block is currently smaller than the target size.
	if estimatedBlockSize <= sizeThreshold {
		// The block is smaller than the threshold size at which we'll consider
		// flushing it.
		return false
	}

In CRDB, the block size is 32kb and the configured size threshold is left at the default (90%). So we won't consider a block for flushing if its estimated size is smaller than 29492 bytes. With 4096 byte values we're guaranteed to have blocks slightly larger than 32kb. I suspect we can do something better here. If we knew the jemalloc size classes, we could make a better decision of whether it reduces internal fragmentation more to flush the block, or to add another entry and then flush the block. Somewhat awkward to have Pebble know about the jemalloc size classes since using jemalloc isn't required. Perhaps a config option which CRDB can specify.

@sumeerbhola
Copy link
Collaborator

Does the compressibility of the data actually matter here given that the block cache stores uncompressed blocks?

No, it doesn't -- I was just playing around.

CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 12, 2024
Previously, the sstable writer contained heuristics to flush sstable
blocks when the size reached a certain threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated with the block causing the allocation
to go beyond this threshold. Since CRDB uses jemalloc, these allocations
use a 40KiB size class which leads to internal fragmentation and higher
memory usage. This commit decrements the block size threshold to reduce
internal memory fragmentation.

Informs: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 15, 2024
Previously, the sstable writer contained heuristics to flush sstable
blocks when the size reached a certain threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated with the block causing the allocation
to go beyond this threshold. Since CRDB uses jemalloc, these allocations
use a 40KiB size class which leads to internal fragmentation and higher
memory usage. This commit decrements the block size threshold to reduce
internal memory fragmentation.

Informs: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 15, 2024
Previously, the sstable writer contained heuristics to flush sstable
blocks when the size reached a certain threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated with the block causing the allocation
to go beyond this threshold. Since CRDB uses jemalloc, these allocations
use a 40KiB size class which leads to internal fragmentation and higher
memory usage. This commit decrements the block size threshold to reduce
internal memory fragmentation.

Informs: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 15, 2024
Previously, the sstable writer contained heuristics to flush sstable
blocks when the size reached a certain threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated with the block causing the allocation
to go beyond this threshold. Since CRDB uses jemalloc, these allocations
use a 40KiB size class which leads to internal fragmentation and higher
memory usage. This commit decrements the block size threshold to reduce
internal memory fragmentation.

Informs: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 15, 2024
Previously, the sstable writer contained heuristics to flush sstable
blocks when the size reached a certain threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated with the block causing the allocation
to go beyond this threshold. Since CRDB uses jemalloc, these allocations
use a 40KiB size class which leads to internal fragmentation and higher
memory usage. This commit decrements the block size threshold to reduce
internal memory fragmentation.

Informs: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 16, 2024
Previously, the sstable writer contained heuristics to flush sstable
blocks when the size reached a certain threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated with the block causing the allocation
to go beyond this threshold. Since CRDB uses jemalloc, these allocations
use a 40KiB size class which leads to internal fragmentation and higher
memory usage. This commit decrements the block size threshold to reduce
internal memory fragmentation.

Informs: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 18, 2024
Previously, the sstable writer contained heuristics to flush sstable
blocks when the size reached a certain threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated with the block causing the allocation
to go beyond this threshold. Since CRDB uses jemalloc, these allocations
use a 40KiB size class which leads to internal fragmentation and higher
memory usage. This commit decrements the block size threshold to reduce
internal memory fragmentation.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 19, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 23, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 24, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 24, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 25, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 25, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 25, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit to CheranMahalingam/pebble that referenced this issue Apr 25, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: cockroachdb#999.
CheranMahalingam added a commit that referenced this issue Apr 26, 2024
Currently, the sstable writer contains heuristics to flush sstable
blocks once the size reaches a specified threshold. In CRDB this is
defined as 32KiB. However, when these blocks are loaded into memory
additional metadata is allocated sometimes exceeding the 32KiB threshold.
Since CRDB uses jemalloc, these allocations use a 40KiB size class which
leads to significant internal fragmentation. In addition, since the
system is unaware of these size classes we cannot design heuristics that
prioritize reducing memory fragmentation. Reducing internal
fragmentation can help reduce CRDB's memory footprint. This commit
decrements the target block size to prevent internal fragmentation for
small key-value pairs and adds support for optionally specifying size
classes to enable a new set of heuristics that will reduce internal
fragmentation for workloads with larger key-value pairs.

Fixes: #999.
@jbowens jbowens moved this to Done in [Deprecated] Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants