Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE2 patches for encoding and decoding functions #302
SSE2 patches for encoding and decoding functions #302
Changes from all commits
77775c5
eaab373
0ad0820
d94d2ef
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just
break
here actually? Does it affect performance? I imagine we could compare a whole 128-byte register with a broadcasted0xFF80
, and eitherbreak
or_mm_packus_epi16
+_mm_storel_epi64
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I experimented with where exactly and after how many checks the break can be and chose one of the best performing combinations. Unfortunately I didn't record those combinations and the results might be different on different CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it looks asymmetric to exercise each pair of bytes from the first
uint64
, but not in the second one. I understand that doing the same for the seconduint64
make code more hairy, so maybe we can stop doing it for the first one as well? It would also allow us to avoidunion
stuff.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both symmetric variants and found both to be slower than this one.
My understanding is that
This probability is different for different input data, extreme cases would be pure ASCII (simd routine always works) and pure chinese text (simd routine never works). On our test data I observed that inspecting bytes in the first half was a sweet spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you can still pack and store the whole 0'th half using the
pext
instructionhttps://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_pext_u64&expand=4330
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR intentionally uses only SSE2 to be compatible with any 64bit x86 CPU. Quite a few models don't support
pext
or their implementation is not fast. From wikipedia: