Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize StateMachine::ProcessString #15498

Merged
merged 3 commits into from
Jun 30, 2023

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Jun 2, 2023

The added explicit vectorization allows us to skip plain text faster
and pass it immediately to the deeper TextBuffer parts.

Performance of printing enwik8.txt at the following block sizes:
4KiB (printf): 54MB/s -> 58MB/s
128KiB (cat): 103MB/s -> 116MB/s

Validation Steps Performed

  • Works on x64 ✅
  • Works on ARM ✅

@lhecker lhecker added Product-Conhost For issues in the Console codebase Area-Performance Performance-related issue labels Jun 2, 2023
@zadjii
Copy link

zadjii commented Jun 2, 2023

I'm so friggin excited

Base automatically changed from dev/lhecker/vt-perf1 to main June 15, 2023 15:34
@lhecker lhecker force-pushed the dev/lhecker/vt-perf2 branch 2 times, most recently from 64942c1 to fc6bc8d Compare June 28, 2023 15:38
@lhecker lhecker marked this pull request as ready for review June 28, 2023 19:38
Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I trust it!


for (const auto end = data + (count & ~size_t{ 7 }); it < end; it += 8)
{
const auto wch = _mm_loadu_si128(reinterpret_cast<const __m128i*>(it));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadu is "unaligned" right?

Copy link
Member Author

@lhecker lhecker Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. There used to be a performance advantage to the aligned load ops, but nowadays unaligned loads on aligned data perform exactly like aligned loads on aligned data, which is why barely anyone uses the non-u variants now. (But we have to use unaligned loads anyways because our data isn't aligned.)

_BitScanForward(&offset, mask);
it += offset / 2;
return it - data;
}
Copy link
Member Author

@lhecker lhecker Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix:

  • _m_: MMX
  • _mm_: MMX / SSE / SSE2
  • _mm256_: AVX / AVX2

suffix:

  • epi: signed integer
  • epu: unsigned integer
  • epi8: component size = 8 bit
  • epi16: component size = 16 bit
  • si128: component size = all 128 bit
    component = in which granularity the operations are executed. byte-wise comparisons? wchar_t-wise comparisons? etc.

name:

  • loadu: load 128 bits from pointer, unaligned
  • setzero: 128 bits, all zero
  • subs: subtraction with saturation (max(0, a - b))
  • cmpeq: compare equals
  • or: bitwise or
  • movemask: take the lowest bit of each component and store it in a single int
    a SSE vector like 0xFFFF00FF00FF0000FFFF00FF00FF0000 results in 1101010011010100
    combined with cmpeq and _BitScanForward it lets you find the first component where the comparison was successful

{
goto exitWithMask;
}
it += 4;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v[c]name[q][_n]_type

  • v: vector operation
  • c: component-wise operation
  • q: quad double words = 128 bit operation
  • _n: scalar instead of vector arguments
  • name:
    • ld1: load bytes (unaligned by default)
    • dup: duplicate argument into all components
    • sub: subtract
    • le: lower or equal
    • orr: bitwise or
  • type:
    • s: signed
    • u: unsigned
    • u8: 8 bits per component
    • u16: 16 bits per component

vgetq_lane_u64: extracts the upper/lower 64 bits of the 128 bit vector

Copy link
Member

@carlos-zamora carlos-zamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Dustin helped me understand it better, so thanks a lot to both of you for all the help!

// It's written like this to get MSVC to emit optimal assembly for findActionableFromGround.
// It lacks the ability to turn boolean operators into binary operations and also happens
// to fail to optimize the printable-ASCII range check into a subtraction & comparison.
return (wch <= 0x1f) | (static_cast<wchar_t>(wch - 0x7f) <= 0x20);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (wch <= 0x1f) | (static_cast<wchar_t>(wch - 0x7f) <= 0x20);
return (wch <= AsciiChars::US) | (static_cast<wchar_t>(wch - 0x7f) <= AsciiChars::SPC);

if possible. Helps with readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, eh, take this with a grain of salt. The existing comments are pretty good, actually.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the code is really cryptic but I think exactly because of that using these magic numbers is better. I can't really explain it... It's like... If you go cryptic, you should go full cryptic? For instance the 0x20 can't really be replaced with SPC because it's only 0x20 because that's the result of 0x9f-0x7f. 😅
I think the method comment

// Returns true for C0 characters and C1 [single-character] CSI.

is the most important part. While one has to look up what C0 and C1 characters are, it's self-explanatory after doing so (for instance on Wikipedia).


// Check for (wch < 0x20)
auto a = _mm_subs_epu16(wch, _mm_set1_epi16(0x1f));
// Check for "(wch >= 0x7f && wch <= 0x9f)" by adding 0x10000-0x7f, which overflows to a
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: Replace (wch >= 0x7f && wch <= 0x9f) with (wch - 0x7f) <= 0x20 for consistency and because it makes more sense with the 0xff81.

@lhecker lhecker added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jun 30, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot deleted the dev/lhecker/vt-perf2 branch June 30, 2023 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Performance Performance-related issue AutoMerge Marked for automatic merge by the bot when requirements are met Product-Conhost For issues in the Console codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants