Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized datastructure for tracking PollStates #48

Conversation

michaelwoerister
Copy link
Collaborator

@michaelwoerister michaelwoerister commented Nov 9, 2022

Closes #42. Not very polished yet but it should get the job done.

Copy link
Owner

@yoshuawuyts yoshuawuyts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miri seems to be hanging on CI. I'm not sure what's causing that tbh, but we should probably resolve it before merging.

@michaelwoerister
Copy link
Collaborator Author

Miri probably is just timing out because the test is doing too much work. Let's see if reducing the number of iterations for miri solves the problem.

@yoshuawuyts yoshuawuyts mentioned this pull request Nov 10, 2022
Copy link
Owner

@yoshuawuyts yoshuawuyts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, and I'm happy to merge this as is. For performance though, there is a bit of question to what's going on. I've tested this against the suite from: #52

Benchmarks

10 items

We're seeing roughly similar performance here, but with less deviation which seems promising:

vec-join-10-Criterion-rs

100 items

This seems like a slight performance slowdown; the 1000 item case too.

vec-join-100-Criterion-rs

@michaelwoerister
Copy link
Collaborator Author

michaelwoerister commented Nov 10, 2022

Slower? We can't have that! 😉

Is there a convenient way to compare two git branches with criterion?

@michaelwoerister
Copy link
Collaborator Author

This version should do a bit better, but it still seems to be slower. I'm not sure if this optimization is worth the trouble.

@yoshuawuyts
Copy link
Owner

yoshuawuyts commented Nov 11, 2022

Yeah the benches I'm now getting are:

❯ critcmp main patch -f vec::join
group             main                                   patch
-----             ----                                   -----
vec::join 10      1.00    646.7±5.73ns        ? ?/sec    1.01   654.0±34.59ns        ? ?/sec
vec::join 100     1.00      3.3±0.17µs        ? ?/sec    1.18      3.9±0.12µs        ? ?/sec
vec::join 1000    1.00     27.3±0.21µs        ? ?/sec    1.20     32.8±1.55µs        ? ?/sec

Diagnosis

But I think I may have found something? The main idea here is that if we have a number of entries which is less than the size of a pointer, we can inline the entries in the space of the pointer. However, if we validate the sizes, it turns out we're not quite doing that right now!

#[test]
fn check_bit_sizes() {
    const BYTE: usize = 8;
    assert_eq!(std::mem::size_of::<EntriesWord>() * BYTE, 64);      // ✅ 64 bits
    assert_eq!(std::mem::size_of::<PollStateEntries>() * BYTE, 64); // ❌ 192 bits
}

It seems the PollStateEntries enum is not 1 byte, but 3 bytes. I suspect that may be messing with cache coherence, and be the cause of the performance regression in the vec case.

Potential Directions

/// The max number of entries `PollStates` can store without heap allocations.
const MAX_INLINE_ENTRY_COUNT: usize = ENTRIES_PER_WORD;
const ENTRIES_PER_WORD: usize = ((std::mem::size_of::<EntriesWord>() * 8) / 2);

enum PollStateEntries {
    Inline(EntriesWord),
    Boxed(Vec<EntriesWord>),
}

A few things stand out to me here:

  • EntriesWord is a u64, but Vec<EntriesWord> is a usize. These may be the same on 64-bit machines, but they aren't necessarily.
  • ENTRIES_PER_WORD will compile down to 32 on 64-bit machines. If you put that in an enum however, you need another byte to allocate the label on.

What I'm wondering is if by changing these variables around we can get the size of PollStateEntries down to 1 byte. Some combination of using usize everywhere and leaving space for the entries in the label. Or switching ENTRIES_PER_WORD to be NonZeroUsize might be able to help. I'm not sure, but it might be worth playing around with?

@yoshuawuyts
Copy link
Owner

yoshuawuyts commented Nov 11, 2022

Oh actually: we're a bit off here - the assumption that "pointer to vec is 64 bits" is wrong. A Vec is a wide pointer, containing both the actual pointer to the heap, but also the length. That means it's three words wide:

assert_eq!(std::mem::size_of::<Vec<()>>() * BYTE, 192);  // ✅ 192 bits

The issue isn't that we're not aligning our usizes and what not right. There's plenty of space for that. In fact, if we do it right we should be able to store up to 3 words worth of entries in the space of a single Vec. So I'm not really sure what else might be causing it. It really might just be some sort of other alignment thing?

@yoshuawuyts
Copy link
Owner

Okay, I did find another interesting bit! I was looking at the size of PollStateEntries, but I should have been looking at the size of PollStates itself. It's defined as:

pub(crate) struct PollStates {
    len: usize,
    entries: PollStateEntries,
}

What we'd want is for PollStates to be the same size as Vec<PollState>. Instead, it's actually larger!

assert_eq!(std::mem::size_of::<Vec<PollState>>() * BYTE, 192); // ✅ 192 bits
assert_eq!(std::mem::size_of::<PollStates>() * BYTE, 192);     // ❌ 256 bits

And if you think about it: this makes sense. In the enum we're making space for Vec<PollState> which keeps a pointer + length. But we then wrap that in another struct which also keeps track of length. That means we're tracking the length twice. And if you add on top of that what is potentially an EntriesWord which doesn't account for the enum label sizes - we're now larger than the original Vec<PollState>.

PollStates::len should only be needed for the PollStateEntries::inline case. I wonder if we put it in the enum's arm whether that can reduce the total size of the struct. And in turn make the Boxed case faster, since it will not need to track len in two places at the same time.

There might be other optimizations possible too, wrt accounting for the labels and what not. But that seems like the right place to start perhaps?

@yoshuawuyts
Copy link
Owner

yoshuawuyts commented Nov 11, 2022

And confirmation that inlining the length in the Inline branch actually normalizes the sizes (playground):

enum PollStateEntries {
    Inline(usize, EntriesWord), // now stores an extra usize
    Boxed(Vec<EntriesWord>),
}

pub(crate) struct PollStates { // no longer stores a usize
    entries: PollStateEntries,
}

#[test]
fn check_bit_sizes() {
    const BYTE: usize = 8;
    assert_eq!(std::mem::size_of::<Vec<EntriesWord>>() * BYTE, 192); // ✅ 3 bytes
    assert_eq!(std::mem::size_of::<PollStateEntries>() * BYTE, 192); // ✅ 3 bytes
    assert_eq!(std::mem::size_of::<PollStates>() * BYTE, 192);       // ✅ 3 bytes
}

@michaelwoerister
Copy link
Collaborator Author

I'm starting to think that storing PollStates as bit masks doesn't quite carry its complexity weight (and it's not actually faster, at least in the micro benchmarks). I'll post a much simpler version when I find the time.

@michaelwoerister
Copy link
Collaborator Author

The length of Vec<EntriesWord> is not the same as the number of entries, since the last word might not be fully used.

@michaelwoerister
Copy link
Collaborator Author

Closing in favor of the much simpler #78.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inline PollState in future when possible
2 participants