Add optimized datastructure for tracking PollStates #48

michaelwoerister · 2022-11-09T16:06:56Z

Closes #42. Not very polished yet but it should get the job done.

yoshuawuyts

miri seems to be hanging on CI. I'm not sure what's causing that tbh, but we should probably resolve it before merging.

michaelwoerister · 2022-11-10T14:18:37Z

Miri probably is just timing out because the test is doing too much work. Let's see if reducing the number of iterations for miri solves the problem.

yoshuawuyts

The code looks good, and I'm happy to merge this as is. For performance though, there is a bit of question to what's going on. I've tested this against the suite from: #52

Benchmarks

10 items

We're seeing roughly similar performance here, but with less deviation which seems promising:

100 items

This seems like a slight performance slowdown; the 1000 item case too.

michaelwoerister · 2022-11-10T20:17:33Z

Slower? We can't have that! 😉

Is there a convenient way to compare two git branches with criterion?

michaelwoerister · 2022-11-11T12:03:56Z

This version should do a bit better, but it still seems to be slower. I'm not sure if this optimization is worth the trouble.

yoshuawuyts · 2022-11-11T16:05:22Z

Yeah the benches I'm now getting are:

❯ critcmp main patch -f vec::join
group             main                                   patch
-----             ----                                   -----
vec::join 10      1.00    646.7±5.73ns        ? ?/sec    1.01   654.0±34.59ns        ? ?/sec
vec::join 100     1.00      3.3±0.17µs        ? ?/sec    1.18      3.9±0.12µs        ? ?/sec
vec::join 1000    1.00     27.3±0.21µs        ? ?/sec    1.20     32.8±1.55µs        ? ?/sec

Diagnosis

But I think I may have found something? The main idea here is that if we have a number of entries which is less than the size of a pointer, we can inline the entries in the space of the pointer. However, if we validate the sizes, it turns out we're not quite doing that right now!

#[test]
fn check_bit_sizes() {
    const BYTE: usize = 8;
    assert_eq!(std::mem::size_of::<EntriesWord>() * BYTE, 64);      // ✅ 64 bits
    assert_eq!(std::mem::size_of::<PollStateEntries>() * BYTE, 64); // ❌ 192 bits
}

It seems the PollStateEntries enum is not 1 byte, but 3 bytes. I suspect that may be messing with cache coherence, and be the cause of the performance regression in the vec case.

Potential Directions

/// The max number of entries `PollStates` can store without heap allocations.
const MAX_INLINE_ENTRY_COUNT: usize = ENTRIES_PER_WORD;
const ENTRIES_PER_WORD: usize = ((std::mem::size_of::<EntriesWord>() * 8) / 2);

enum PollStateEntries {
    Inline(EntriesWord),
    Boxed(Vec<EntriesWord>),
}

A few things stand out to me here:

EntriesWord is a u64, but Vec<EntriesWord> is a usize. These may be the same on 64-bit machines, but they aren't necessarily.
ENTRIES_PER_WORD will compile down to 32 on 64-bit machines. If you put that in an enum however, you need another byte to allocate the label on.

What I'm wondering is if by changing these variables around we can get the size of PollStateEntries down to 1 byte. Some combination of using usize everywhere and leaving space for the entries in the label. Or switching ENTRIES_PER_WORD to be NonZeroUsize might be able to help. I'm not sure, but it might be worth playing around with?

yoshuawuyts · 2022-11-11T16:19:44Z

Oh actually: we're a bit off here - the assumption that "pointer to vec is 64 bits" is wrong. A Vec is a wide pointer, containing both the actual pointer to the heap, but also the length. That means it's three words wide:

assert_eq!(std::mem::size_of::<Vec<()>>() * BYTE, 192);  // ✅ 192 bits

The issue isn't that we're not aligning our usizes and what not right. There's plenty of space for that. In fact, if we do it right we should be able to store up to 3 words worth of entries in the space of a single Vec. So I'm not really sure what else might be causing it. It really might just be some sort of other alignment thing?

yoshuawuyts · 2022-11-11T16:33:04Z

Okay, I did find another interesting bit! I was looking at the size of PollStateEntries, but I should have been looking at the size of PollStates itself. It's defined as:

pub(crate) struct PollStates {
    len: usize,
    entries: PollStateEntries,
}

What we'd want is for PollStates to be the same size as Vec<PollState>. Instead, it's actually larger!

assert_eq!(std::mem::size_of::<Vec<PollState>>() * BYTE, 192); // ✅ 192 bits
assert_eq!(std::mem::size_of::<PollStates>() * BYTE, 192);     // ❌ 256 bits

And if you think about it: this makes sense. In the enum we're making space for Vec<PollState> which keeps a pointer + length. But we then wrap that in another struct which also keeps track of length. That means we're tracking the length twice. And if you add on top of that what is potentially an EntriesWord which doesn't account for the enum label sizes - we're now larger than the original Vec<PollState>.

PollStates::len should only be needed for the PollStateEntries::inline case. I wonder if we put it in the enum's arm whether that can reduce the total size of the struct. And in turn make the Boxed case faster, since it will not need to track len in two places at the same time.

There might be other optimizations possible too, wrt accounting for the labels and what not. But that seems like the right place to start perhaps?

yoshuawuyts · 2022-11-11T16:40:29Z

And confirmation that inlining the length in the Inline branch actually normalizes the sizes (playground):

enum PollStateEntries {
    Inline(usize, EntriesWord), // now stores an extra usize
    Boxed(Vec<EntriesWord>),
}

pub(crate) struct PollStates { // no longer stores a usize
    entries: PollStateEntries,
}

#[test]
fn check_bit_sizes() {
    const BYTE: usize = 8;
    assert_eq!(std::mem::size_of::<Vec<EntriesWord>>() * BYTE, 192); // ✅ 3 bytes
    assert_eq!(std::mem::size_of::<PollStateEntries>() * BYTE, 192); // ✅ 3 bytes
    assert_eq!(std::mem::size_of::<PollStates>() * BYTE, 192);       // ✅ 3 bytes
}

michaelwoerister · 2022-11-11T17:56:51Z

I'm starting to think that storing PollStates as bit masks doesn't quite carry its complexity weight (and it's not actually faster, at least in the micro benchmarks). I'll post a much simpler version when I find the time.

michaelwoerister · 2022-11-11T18:00:36Z

The length of Vec<EntriesWord> is not the same as the number of entries, since the last word might not be fully used.

michaelwoerister · 2022-11-15T14:45:42Z

Closing in favor of the much simpler #78.

yoshuawuyts requested changes Nov 10, 2022

View reviewed changes

michaelwoerister requested a review from yoshuawuyts November 10, 2022 15:01

add vec_merge benches

2eecc09

yoshuawuyts mentioned this pull request Nov 10, 2022

add more benchmarks #52

Merged

yoshuawuyts approved these changes Nov 10, 2022

View reviewed changes

yoshuawuyts added 2 commits November 10, 2022 18:03

finish infallible benchmarks

ff535d3

use criterion in async mode

1950ea2

michaelwoerister force-pushed the compressed-poll-state branch from a9348e1 to f5b2ff8 Compare November 11, 2022 11:59

Add optimized datastructure for tracking PollStates.

0198099

michaelwoerister force-pushed the compressed-poll-state branch from f5b2ff8 to 0198099 Compare November 11, 2022 12:00

michaelwoerister mentioned this pull request Nov 15, 2022

Avoid allocation for small length in PollState tracking #78

Merged

michaelwoerister closed this Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized datastructure for tracking PollStates #48

Add optimized datastructure for tracking PollStates #48

michaelwoerister commented Nov 9, 2022 •

edited by yoshuawuyts

Loading

yoshuawuyts left a comment

michaelwoerister commented Nov 10, 2022

yoshuawuyts left a comment •

edited

Loading

michaelwoerister commented Nov 10, 2022 •

edited

Loading

michaelwoerister commented Nov 11, 2022

yoshuawuyts commented Nov 11, 2022 •

edited

Loading

yoshuawuyts commented Nov 11, 2022 •

edited

Loading

yoshuawuyts commented Nov 11, 2022

yoshuawuyts commented Nov 11, 2022 •

edited

Loading

michaelwoerister commented Nov 11, 2022

michaelwoerister commented Nov 11, 2022

michaelwoerister commented Nov 15, 2022

Add optimized datastructure for tracking PollStates #48

Add optimized datastructure for tracking PollStates #48

Conversation

michaelwoerister commented Nov 9, 2022 • edited by yoshuawuyts Loading

yoshuawuyts left a comment

Choose a reason for hiding this comment

michaelwoerister commented Nov 10, 2022

yoshuawuyts left a comment • edited Loading

Choose a reason for hiding this comment

Benchmarks

10 items

100 items

michaelwoerister commented Nov 10, 2022 • edited Loading

michaelwoerister commented Nov 11, 2022

yoshuawuyts commented Nov 11, 2022 • edited Loading

Diagnosis

Potential Directions

yoshuawuyts commented Nov 11, 2022 • edited Loading

yoshuawuyts commented Nov 11, 2022

yoshuawuyts commented Nov 11, 2022 • edited Loading

michaelwoerister commented Nov 11, 2022

michaelwoerister commented Nov 11, 2022

michaelwoerister commented Nov 15, 2022

michaelwoerister commented Nov 9, 2022 •

edited by yoshuawuyts

Loading

yoshuawuyts left a comment •

edited

Loading

michaelwoerister commented Nov 10, 2022 •

edited

Loading

yoshuawuyts commented Nov 11, 2022 •

edited

Loading

yoshuawuyts commented Nov 11, 2022 •

edited

Loading

yoshuawuyts commented Nov 11, 2022 •

edited

Loading