New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Vectorize StateMachine::ProcessString #15498

Merged

microsoft-github-policy-service merged 3 commits into main from dev/lhecker/vt-perf2

Jun 30, 2023

Member

lhecker commented Jun 2, 2023 •

edited

Loading

The added explicit vectorization allows us to skip plain text faster
and pass it immediately to the deeper TextBuffer parts.

Performance of printing enwik8.txt at the following block sizes:
4KiB (printf): 54MB/s -> 58MB/s
128KiB (cat): 103MB/s -> 116MB/s

Validation Steps Performed

Works on x64 ✅
Works on ARM ✅

lhecker added Product-Conhost Area-Performance labels

zadjii commented Jun 2, 2023

I'm so friggin excited

lhecker mentioned this pull request

Vectorize ROW initialization #15501

Merged

Base automatically changed from dev/lhecker/vt-perf1 to main

June 15, 2023 15:34

lhecker force-pushed the dev/lhecker/vt-perf2 branch 2 times, most recently from 64942c1 to fc6bc8d Compare

June 28, 2023 15:38


          Vectorize StateMachine::ProcessString

59400e2

lhecker force-pushed the dev/lhecker/vt-perf2 branch from fc6bc8d to 59400e2 Compare

June 28, 2023 19:35

lhecker marked this pull request as ready for review

June 28, 2023 19:38

DHowett approved these changes

View reviewed changes

Member

DHowett left a comment

I think I trust it!

DHowett reviewed

View reviewed changes

src/terminal/parser/stateMachine.cpp

+                  for (const auto end = data + (count & ~size_t{ 7 }); it < end; it += 8)
+                  {
+                      const auto wch = _mm_loadu_si128(reinterpret_cast<const __m128i*>(it));

Member

DHowett Jun 28, 2023

loadu is "unaligned" right?

Member Author

lhecker Jun 28, 2023 •

edited

Loading

Yeah exactly. There used to be a performance advantage to the aligned load ops, but nowadays unaligned loads on aligned data perform exactly like aligned loads on aligned data, which is why barely anyone uses the non-u variants now. (But we have to use unaligned loads anyways because our data isn't aligned.)

lhecker commented

View reviewed changes

src/terminal/parser/stateMachine.cpp

+                          _BitScanForward(&offset, mask);
+                          it += offset / 2;
+                          return it - data;
+                      }

Member Author

lhecker Jun 28, 2023 •

edited

Loading

prefix:

_m_: MMX
_mm_: MMX / SSE / SSE2
_mm256_: AVX / AVX2

suffix:

epi: signed integer
epu: unsigned integer
epi8: component size = 8 bit
epi16: component size = 16 bit
si128: component size = all 128 bit
component = in which granularity the operations are executed. byte-wise comparisons? wchar_t-wise comparisons? etc.

name:

loadu: load 128 bits from pointer, unaligned
setzero: 128 bits, all zero
subs: subtraction with saturation (max(0, a - b))
cmpeq: compare equals
or: bitwise or
movemask: take the lowest bit of each component and store it in a single int
a SSE vector like 0xFFFF00FF00FF0000FFFF00FF00FF0000 results in 1101010011010100
combined with cmpeq and _BitScanForward it lets you find the first component where the comparison was successful

src/terminal/parser/stateMachine.cpp

+                      {
+                          goto exitWithMask;
+                      }
+                      it += 4;

Member Author

lhecker Jun 28, 2023

v[c]name[q][_n]_type

v: vector operation
c: component-wise operation
q: quad double words = 128 bit operation
_n: scalar instead of vector arguments
name:
- ld1: load bytes (unaligned by default)
- dup: duplicate argument into all components
- sub: subtract
- le: lower or equal
- orr: bitwise or
type:
- s: signed
- u: unsigned
- u8: 8 bits per component
- u16: 16 bits per component

vgetq_lane_u64: extracts the upper/lower 64 bits of the 128 bit vector


          Fix AuditMode failures

4bc3f03

carlos-zamora approved these changes

View reviewed changes

Member

carlos-zamora left a comment

Looks great! Dustin helped me understand it better, so thanks a lot to both of you for all the help!

src/terminal/parser/stateMachine.cpp

+                  // It's written like this to get MSVC to emit optimal assembly for findActionableFromGround.
+                  // It lacks the ability to turn boolean operators into binary operations and also happens
+                  // to fail to optimize the printable-ASCII range check into a subtraction & comparison.
+                  return (wch <= 0x1f) | (static_cast<wchar_t>(wch - 0x7f) <= 0x20);

Member

carlos-zamora Jun 29, 2023

Suggested change

      
                return (wch <= 0x1f) | (static_cast<wchar_t>(wch - 0x7f) <= 0x20);
          
                return (wch <= AsciiChars::US) | (static_cast<wchar_t>(wch - 0x7f) <= AsciiChars::SPC);

if possible. Helps with readability.

Member

carlos-zamora Jun 29, 2023

oh, eh, take this with a grain of salt. The existing comments are pretty good, actually.

Member Author

lhecker Jun 30, 2023

Yeah the code is really cryptic but I think exactly because of that using these magic numbers is better. I can't really explain it... It's like... If you go cryptic, you should go full cryptic? For instance the 0x20 can't really be replaced with SPC because it's only 0x20 because that's the result of 0x9f-0x7f. 😅
I think the method comment

// Returns true for C0 characters and C1 [single-character] CSI.

is the most important part. While one has to look up what C0 and C1 characters are, it's self-explanatory after doing so (for instance on Wikipedia).

lhecker commented

View reviewed changes

src/terminal/parser/stateMachine.cpp Outdated

+                      // Check for (wch < 0x20)
+                      auto a = _mm_subs_epu16(wch, _mm_set1_epi16(0x1f));
+                      // Check for "(wch >= 0x7f && wch <= 0x9f)" by adding 0x10000-0x7f, which overflows to a

Member Author

lhecker Jun 30, 2023

Note to self: Replace (wch >= 0x7f && wch <= 0x9f) with (wch - 0x7f) <= 0x20 for consistency and because it makes more sense with the 0xff81.


          Fix comment, Fix operand size

380cb99

lhecker added the AutoMerge label

microsoft-github-policy-service bot enabled auto-merge (squash)

June 30, 2023 13:36

microsoft-github-policy-service bot merged commit c22d9b1 into main

microsoft-github-policy-service bot deleted the dev/lhecker/vt-perf2 branch

June 30, 2023 14:11

zadjii-msft mentioned this pull request

Very slow rendering of colored text #4129

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-Performance AutoMerge Product-Conhost