-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize StateMachine::ProcessString #15498
Conversation
I'm so friggin excited |
64942c1
to
fc6bc8d
Compare
fc6bc8d
to
59400e2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I trust it!
|
||
for (const auto end = data + (count & ~size_t{ 7 }); it < end; it += 8) | ||
{ | ||
const auto wch = _mm_loadu_si128(reinterpret_cast<const __m128i*>(it)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadu is "unaligned" right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly. There used to be a performance advantage to the aligned load ops, but nowadays unaligned loads on aligned data perform exactly like aligned loads on aligned data, which is why barely anyone uses the non-u variants now. (But we have to use unaligned loads anyways because our data isn't aligned.)
_BitScanForward(&offset, mask); | ||
it += offset / 2; | ||
return it - data; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix:
_m_
: MMX_mm_
: MMX / SSE / SSE2_mm256_
: AVX / AVX2
suffix:
epi
: signed integerepu
: unsigned integerepi8
: component size = 8 bitepi16
: component size = 16 bitsi128
: component size = all 128 bit
component = in which granularity the operations are executed. byte-wise comparisons? wchar_t-wise comparisons? etc.
name:
loadu
: load 128 bits from pointer, unalignedsetzero
: 128 bits, all zerosubs
: subtraction with saturation (max(0, a - b)
)cmpeq
: compare equalsor
: bitwise ormovemask
: take the lowest bit of each component and store it in a singleint
a SSE vector like 0xFFFF00FF00FF0000FFFF00FF00FF0000 results in1101010011010100
combined withcmpeq
and_BitScanForward
it lets you find the first component where the comparison was successful
{ | ||
goto exitWithMask; | ||
} | ||
it += 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v[c]name[q][_n]_type
v
: vector operationc
: component-wise operationq
: quad double words = 128 bit operation_n
: scalar instead of vector arguments- name:
ld1
: load bytes (unaligned by default)dup
: duplicate argument into all componentssub
: subtractle
: lower or equalorr
: bitwise or
- type:
s
: signedu
: unsignedu8
: 8 bits per componentu16
: 16 bits per component
vgetq_lane_u64
: extracts the upper/lower 64 bits of the 128 bit vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Dustin helped me understand it better, so thanks a lot to both of you for all the help!
// It's written like this to get MSVC to emit optimal assembly for findActionableFromGround. | ||
// It lacks the ability to turn boolean operators into binary operations and also happens | ||
// to fail to optimize the printable-ASCII range check into a subtraction & comparison. | ||
return (wch <= 0x1f) | (static_cast<wchar_t>(wch - 0x7f) <= 0x20); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (wch <= 0x1f) | (static_cast<wchar_t>(wch - 0x7f) <= 0x20); | |
return (wch <= AsciiChars::US) | (static_cast<wchar_t>(wch - 0x7f) <= AsciiChars::SPC); |
if possible. Helps with readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, eh, take this with a grain of salt. The existing comments are pretty good, actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the code is really cryptic but I think exactly because of that using these magic numbers is better. I can't really explain it... It's like... If you go cryptic, you should go full cryptic? For instance the 0x20
can't really be replaced with SPC
because it's only 0x20 because that's the result of 0x9f-0x7f
. 😅
I think the method comment
// Returns true for C0 characters and C1 [single-character] CSI.
is the most important part. While one has to look up what C0 and C1 characters are, it's self-explanatory after doing so (for instance on Wikipedia).
src/terminal/parser/stateMachine.cpp
Outdated
|
||
// Check for (wch < 0x20) | ||
auto a = _mm_subs_epu16(wch, _mm_set1_epi16(0x1f)); | ||
// Check for "(wch >= 0x7f && wch <= 0x9f)" by adding 0x10000-0x7f, which overflows to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: Replace (wch >= 0x7f && wch <= 0x9f)
with (wch - 0x7f) <= 0x20
for consistency and because it makes more sense with the 0xff81.
The added explicit vectorization allows us to skip plain text faster
and pass it immediately to the deeper
TextBuffer
parts.Performance of printing enwik8.txt at the following block sizes:
4KiB (printf): 54MB/s -> 58MB/s
128KiB (cat): 103MB/s -> 116MB/s
Validation Steps Performed