Optimize hot path in textBufferCellIterator #10621

skyline75489 · 2021-07-12T03:16:28Z

Summary of the Pull Request

References

The += operator is an extremely hot path under heavily output load. This PR aims to optimize its speed.

PR Checklist

Supports [Performance] vtebench tracking issue #10563
CLA signed. If not, go over here and sign the CLA
Tests added/passed
Documentation updated. If checked, please file a pull request on our docs repo and link it here: #xxx
Schema updated.
I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

skyline75489 · 2021-07-12T03:20:56Z

This PR only drops the time needed for "1 million yes" from ~8.0s to 7.0s, even 6.6s if it's in a good mood. The CPU usage of += drops from 30%-40% down to 8%(!).

Note: this is very x64 specific optimization, if not AMD Ryzen specific. The reason why this PR works, is that it reduces the cache pressure on the CPU, which dramatically improves the performance. This work is once again inspired by the one and only @lhecker 😃

OpenConsole.exe trace

Before (pay attention to both operator += and _GenerateView):

After:

skyline75489 · 2021-07-15T03:40:04Z

src/buffer/out/textBufferCellIterator.cpp

@@ -95,18 +95,66 @@ bool TextBufferCellIterator::operator!=(const TextBufferCellIterator& it) const
 TextBufferCellIterator& TextBufferCellIterator::operator+=(const ptrdiff_t& movement)
 {
    ptrdiff_t move = movement;
-    auto newPos = _pos;
-    while (move > 0 && !_exceeded)
+    if (move < 0)


Let's say we are doing the "1 million yes" benchmark with a typical 120x80 windows. The += operator will execute 120 x 1000000 = 120000000 times. In this particular method, I think it's safe to say that every single instruction counts.

First, this line here is the early branching to avoid the very rare case (move < 0) and leave it to -= operator. Next I replace the IncrementInBounds with an simplified, cache-friendly version, without compromising functionally. And finally I replace the _GenerateView with fine-grained Update methods.

if _GenerateView is that bad... I feel like we should replace all usages of it, not just this one += usage of it.

Same goes for IncrementInBounds and DecrementInbounds.

I'm OK with it being more complex inside this operator specifically as long as it's well documented why things are ordered and computed in the way they are (like specifically calling out which items below are taking advantage of cache improvements so no one tries to refactor it out to be "easier to read" but significantly worse performance in the future.)

Besides the "early optimization is the root of all evil" cliche, I'd like to point out that _GenerateView, IncrementInBounds and DecrementInbounds are not actually expensive methods in terms of CPU costs.

Take a look at my "before" screenshot and you will see WalkInBoundsCircular (which is essentially what IncrementInBounds calls) is only 1.35% of the CPU time. And it's in fact accurate, because that method itself isn't guilty for the performance drop. It's the unfortune combination of a whole lot of other methods in the hot paths that ruins the icache (instruction cache) or dcache (data cache) at some particular point.

Another example: I've actually tried to remove the code inside this method step by step. At some point, even a single line like _pos = newPos can show up >7% in the trace, which is obviously not a really expensive operation, but triggers the cache miss.

In conclusion, it takes accurate benchmarking to "port" this optimization somewhere else. And the optimization may very well be NOT necessary in other code paths.

I got the inspiration from @lhecker and he might want to add his opinions here, too. I'll try to add more comments as documentation to prevent "easier to read" refactoring in the future.

Great justification. I'm fine with you not optimizing the other ones. Just call out some of this in the comments in the code and I'm sold.

miniksa · 2021-07-15T21:16:28Z

Sheesh. I had thought the only way of improving this one was going to be a multi-pass operation where we pre-pass some of the data to identify things that were uncomplicated and then bulk fill them into the destination via memcpy instead of handling them like this one-by-one. Perhaps that's still in the future cards, but I'm glad to see this method improve.

miniksa · 2021-07-15T21:21:24Z

src/buffer/out/OutputCellView.hpp

@@ -36,6 +36,21 @@ class OutputCellView
    TextAttribute TextAttr() const noexcept;
    TextAttributeBehavior TextAttrBehavior() const noexcept;

+    void UpdateText(const std::wstring_view& view) noexcept


I'm not sure I understand why calling each of these bits separately is more performant than replacing them all at once via a new-construction. It's also sort of confusing why Behavior is left out, but the other 3 properties can be changed.

I don't like exposing these methods, either. I love the idea of read-only view. Here's my understanding of why this helps.

Before this PR, the code looks like this:

const auto diff = gsl::narrow_cast<ptrdiff_t>(newPos.X) - gsl::narrow_cast<ptrdiff_t>(_pos.X);_ _attrIter += diff; // Here we know _attrIter is hot and in cache. // // ...Some code here. _view = OutputCellView(_pRow->GetCharRow().GlyphAt(_pos.X), _pRow->GetCharRow().DbcsAttrAt(_pos.X), // After all the _pRow operations, _attrIter may not be in cache anymore. *_attrIter, TextAttributeBehavior::Stored);

After this PR:

const auto diff = gsl::narrow_cast<ptrdiff_t>(newX) - gsl::narrow_cast<ptrdiff_t>(oldX); _attrIter += diff; // We know for sure _attrIter is hot. And after this point, it can be safely discarded. _view.UpdateTextAttribute(*_attrIter);

This is why I think the fine-grained UpdateSomething helps the performance.

I'd love to see how @lhecker see this in the assembly view, which would be more accurate. I for one can't read assembly that well yet.

why Behavior is left out, but the other 3 properties can be changed.

Well, it's because I don't need Behavior in this PR 😅

miniksa · 2021-07-15T21:23:45Z

src/buffer/out/textBufferCellIterator.cpp

@@ -95,18 +95,66 @@ bool TextBufferCellIterator::operator!=(const TextBufferCellIterator& it) const
 TextBufferCellIterator& TextBufferCellIterator::operator+=(const ptrdiff_t& movement)
 {
    ptrdiff_t move = movement;
-    auto newPos = _pos;
-    while (move > 0 && !_exceeded)
+    if (move < 0)


if _GenerateView is that bad... I feel like we should replace all usages of it, not just this one += usage of it.

Same goes for IncrementInBounds and DecrementInbounds.

I'm OK with it being more complex inside this operator specifically as long as it's well documented why things are ordered and computed in the way they are (like specifically calling out which items below are taking advantage of cache improvements so no one tries to refactor it out to be "easier to read" but significantly worse performance in the future.)

lhecker · 2021-07-16T20:47:55Z

I tested these changes out and they work very well. Almost unreasonably well.
This PR drops the time to print big.txt by over 30% down below 1s in WSL2. I mean... wow!

As much of a hack this code is I'm 100% in favor of merging it before we ultimately rewrite the console buffer anyways. But I'd prefer to retain _SetPos for operator-= as I believe that operator-= isn't used that much. Because if we do that, we can clearly mark the "hacky" code as being an "inlined" version of _SetPos, which should make the intent clear enough that this can be merged.

skyline75489 · 2021-07-18T11:05:06Z

The ApiGetConsoleOriginalTitleA failure seems to be transient.

DHowett · 2021-07-19T19:06:33Z

The ApiGetConsoleOriginalTitleA failure seems to be transient.

It's absolutely wild that you should say this today. On the same day, you hit a random ApiGetConsoleOriginalTitleA test failure and the OS build hit the same failure and filed a blocking bug on me.

The same day.

We traced it back to an issue in #8621.

skyline75489 · 2021-07-19T21:12:38Z

Now that you mentioned it, I do remember seeing the same failure several times both locally and on CI pipeline. What do you know. The failure loves me 😃 获取 Outlook for iOS<https://aka.ms/o0ukef>

DHowett · 2021-07-19T22:47:11Z

src/buffer/out/OutputCellView.hpp

@@ -6,7 +6,7 @@ Module Name:
 - OutputCellView.hpp

 Abstract:
- Read-only view into a single cell of data that someone is attempting to write into the output buffer.
+- Read view into a single cell of data that someone is attempting to write into the output buffer.


It's.. not read-only any more? You can write through it?

I mean, you can now UpdateSomething in it, which disqualified the “read-only” part, right?

Update: ah I now see what you mean. No, you can't write through it to the real buffer. It's still only a "view" of the buffer. But now you can update properties in it, so I don't think it's "read-only" anymore.

Also I've tried various ways to keep the original _GenerateView (and preserve the read-only-ness in OutputCellView), but I failed to find a way to achieve the same level of performance as the current implementation. I've discussed with @lhecker about this, and he seems to concur with the solution.

ghost · 2021-07-27T14:23:31Z

Hello @lhecker!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.

## Summary of the Pull Request  ## References The `+=` operator is an extremely hot path under heavily output load. This PR aims to optimize its speed.  ## PR Checklist * [ ] Supports #10563 * [ ] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Documentation updated. If checked, please file a pull request on [our docs repo](https://github.com/MicrosoftDocs/terminal) and link it here: #xxx * [ ] Schema updated. * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx  ## Detailed Description of the Pull Request / Additional comments  ## Validation Steps Performed

ghost · 2021-08-31T17:09:42Z

🎉Windows Terminal Preview v1.11.2421.0 has been released which incorporates this pull request.:tada:

Handy links:

Optimize hot path in textBufferCellIterator

3f67edf

zadjii-msft added Area-Performance Performance-related issue Issue-Task It's a feature request, but it doesn't really need a major design. labels Jul 12, 2021

skyline75489 added 4 commits July 14, 2021 21:24

Merge branch 'main' into chesterliu/dev/tbci-hotpath

8aadccb

Update

7feea77

Restore functionailty

737fe3c

Format

f04c94e

skyline75489 marked this pull request as ready for review July 15, 2021 03:15

skyline75489 commented Jul 15, 2021

View reviewed changes

miniksa reviewed Jul 15, 2021

View reviewed changes

review

53cfb27

This comment has been minimized.

Sign in to view

U got me, bot

7ed2a76

Merge branch 'main' into chesterliu/dev/tbci-hotpath

050dc7e

DHowett reviewed Jul 19, 2021

View reviewed changes

skyline75489 mentioned this pull request Jul 20, 2021

[Performance] vtebench tracking issue #10563

Open

miniksa approved these changes Jul 20, 2021

View reviewed changes

Minor code cleanup

8753573

This comment has been minimized.

Sign in to view

Fix spelling

2a12085

lhecker approved these changes Jul 27, 2021

View reviewed changes

lhecker added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jul 27, 2021

ghost merged commit 37e0614 into microsoft:main Jul 27, 2021

skyline75489 deleted the chesterliu/dev/tbci-hotpath branch July 27, 2021 23:43

skyline75489 mentioned this pull request Jul 28, 2021

Optimize hot path in OutputCellIterator #10811

Closed

6 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize hot path in textBufferCellIterator #10621

Optimize hot path in textBufferCellIterator #10621

skyline75489 commented Jul 12, 2021

skyline75489 commented Jul 12, 2021 •

edited

Loading

skyline75489 Jul 15, 2021

miniksa Jul 15, 2021

skyline75489 Jul 16, 2021 •

edited

Loading

miniksa Jul 16, 2021

miniksa commented Jul 15, 2021

miniksa Jul 15, 2021

skyline75489 Jul 16, 2021

skyline75489 Jul 19, 2021

miniksa Jul 15, 2021

lhecker commented Jul 16, 2021 •

edited

Loading

This comment has been minimized.

skyline75489 commented Jul 18, 2021

DHowett commented Jul 19, 2021 •

edited

Loading

skyline75489 commented Jul 19, 2021 via email •

edited by ghost

Loading

DHowett Jul 19, 2021

skyline75489 Jul 20, 2021 •

edited

Loading

skyline75489 Jul 21, 2021

This comment has been minimized.

ghost commented Jul 27, 2021

ghost commented Aug 31, 2021

Optimize hot path in textBufferCellIterator #10621

Optimize hot path in textBufferCellIterator #10621

Conversation

skyline75489 commented Jul 12, 2021

Summary of the Pull Request

References

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

skyline75489 commented Jul 12, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyline75489 Jul 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miniksa commented Jul 15, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhecker commented Jul 16, 2021 • edited Loading

This comment has been minimized.

skyline75489 commented Jul 18, 2021

DHowett commented Jul 19, 2021 • edited Loading

skyline75489 commented Jul 19, 2021 via email • edited by ghost Loading

Choose a reason for hiding this comment

skyline75489 Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

ghost commented Jul 27, 2021

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

ghost commented Aug 31, 2021

skyline75489 commented Jul 12, 2021 •

edited

Loading

skyline75489 Jul 16, 2021 •

edited

Loading

lhecker commented Jul 16, 2021 •

edited

Loading

DHowett commented Jul 19, 2021 •

edited

Loading

skyline75489 commented Jul 19, 2021 via email •

edited by ghost

Loading

skyline75489 Jul 20, 2021 •

edited

Loading

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (`@msftbot`) and give me an instruction to get started! Learn more here.