-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize hot path in textBufferCellIterator #10621
Conversation
This PR only drops the time needed for "1 million yes" from ~8.0s to 7.0s, even 6.6s if it's in a good mood. The CPU usage of Note: this is very x64 specific optimization, if not AMD Ryzen specific. The reason why this PR works, is that it reduces the cache pressure on the CPU, which dramatically improves the performance. This work is once again inspired by the one and only @lhecker 😃 OpenConsole.exe trace Before (pay attention to both After: |
@@ -95,18 +95,66 @@ bool TextBufferCellIterator::operator!=(const TextBufferCellIterator& it) const | |||
TextBufferCellIterator& TextBufferCellIterator::operator+=(const ptrdiff_t& movement) | |||
{ | |||
ptrdiff_t move = movement; | |||
auto newPos = _pos; | |||
while (move > 0 && !_exceeded) | |||
if (move < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say we are doing the "1 million yes" benchmark with a typical 120x80 windows. The +=
operator will execute 120 x 1000000 = 120000000 times. In this particular method, I think it's safe to say that every single instruction counts.
First, this line here is the early branching to avoid the very rare case (move < 0) and leave it to -=
operator. Next I replace the IncrementInBounds
with an simplified, cache-friendly version, without compromising functionally. And finally I replace the _GenerateView
with fine-grained Update
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _GenerateView
is that bad... I feel like we should replace all usages of it, not just this one +=
usage of it.
Same goes for IncrementInBounds
and DecrementInbounds
.
I'm OK with it being more complex inside this operator specifically as long as it's well documented why things are ordered and computed in the way they are (like specifically calling out which items below are taking advantage of cache improvements so no one tries to refactor it out to be "easier to read" but significantly worse performance in the future.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the "early optimization is the root of all evil" cliche, I'd like to point out that _GenerateView
, IncrementInBounds
and DecrementInbounds
are not actually expensive methods in terms of CPU costs.
Take a look at my "before" screenshot and you will see WalkInBoundsCircular
(which is essentially what IncrementInBounds
calls) is only 1.35% of the CPU time. And it's in fact accurate, because that method itself isn't guilty for the performance drop. It's the unfortune combination of a whole lot of other methods in the hot paths that ruins the icache (instruction cache) or dcache (data cache) at some particular point.
Another example: I've actually tried to remove the code inside this method step by step. At some point, even a single line like _pos = newPos
can show up >7% in the trace, which is obviously not a really expensive operation, but triggers the cache miss.
In conclusion, it takes accurate benchmarking to "port" this optimization somewhere else. And the optimization may very well be NOT necessary in other code paths.
I got the inspiration from @lhecker and he might want to add his opinions here, too. I'll try to add more comments as documentation to prevent "easier to read" refactoring in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great justification. I'm fine with you not optimizing the other ones. Just call out some of this in the comments in the code and I'm sold.
Sheesh. I had thought the only way of improving this one was going to be a multi-pass operation where we pre-pass some of the data to identify things that were uncomplicated and then bulk fill them into the destination via |
@@ -36,6 +36,21 @@ class OutputCellView | |||
TextAttribute TextAttr() const noexcept; | |||
TextAttributeBehavior TextAttrBehavior() const noexcept; | |||
|
|||
void UpdateText(const std::wstring_view& view) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why calling each of these bits separately is more performant than replacing them all at once via a new-construction. It's also sort of confusing why Behavior
is left out, but the other 3 properties can be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like exposing these methods, either. I love the idea of read-only view. Here's my understanding of why this helps.
Before this PR, the code looks like this:
const auto diff = gsl::narrow_cast<ptrdiff_t>(newPos.X) - gsl::narrow_cast<ptrdiff_t>(_pos.X);_
_attrIter += diff;
// Here we know _attrIter is hot and in cache.
//
// ...Some code here.
_view = OutputCellView(_pRow->GetCharRow().GlyphAt(_pos.X),
_pRow->GetCharRow().DbcsAttrAt(_pos.X),
// After all the _pRow operations, _attrIter may not be in cache anymore.
*_attrIter,
TextAttributeBehavior::Stored);
After this PR:
const auto diff = gsl::narrow_cast<ptrdiff_t>(newX) - gsl::narrow_cast<ptrdiff_t>(oldX);
_attrIter += diff;
// We know for sure _attrIter is hot. And after this point, it can be safely discarded.
_view.UpdateTextAttribute(*_attrIter);
This is why I think the fine-grained UpdateSomething
helps the performance.
I'd love to see how @lhecker see this in the assembly view, which would be more accurate. I for one can't read assembly that well yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why Behavior is left out, but the other 3 properties can be changed.
Well, it's because I don't need Behavior
in this PR 😅
@@ -95,18 +95,66 @@ bool TextBufferCellIterator::operator!=(const TextBufferCellIterator& it) const | |||
TextBufferCellIterator& TextBufferCellIterator::operator+=(const ptrdiff_t& movement) | |||
{ | |||
ptrdiff_t move = movement; | |||
auto newPos = _pos; | |||
while (move > 0 && !_exceeded) | |||
if (move < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if _GenerateView
is that bad... I feel like we should replace all usages of it, not just this one +=
usage of it.
Same goes for IncrementInBounds
and DecrementInbounds
.
I'm OK with it being more complex inside this operator specifically as long as it's well documented why things are ordered and computed in the way they are (like specifically calling out which items below are taking advantage of cache improvements so no one tries to refactor it out to be "easier to read" but significantly worse performance in the future.)
I tested these changes out and they work very well. Almost unreasonably well. As much of a hack this code is I'm 100% in favor of merging it before we ultimately rewrite the console buffer anyways. But I'd prefer to retain |
This comment has been minimized.
This comment has been minimized.
The |
It's absolutely wild that you should say this today. On the same day, you hit a random ApiGetConsoleOriginalTitleA test failure and the OS build hit the same failure and filed a blocking bug on me. The same day. We traced it back to an issue in #8621. |
Now that you mentioned it, I do remember seeing the same failure several times both locally and on CI pipeline. What do you know. The failure loves me 😃
获取 Outlook for iOS<https://aka.ms/o0ukef>
|
@@ -6,7 +6,7 @@ Module Name: | |||
- OutputCellView.hpp | |||
|
|||
Abstract: | |||
- Read-only view into a single cell of data that someone is attempting to write into the output buffer. | |||
- Read view into a single cell of data that someone is attempting to write into the output buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's.. not read-only any more? You can write through it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, you can now UpdateSomething in it, which disqualified the “read-only” part, right?
Update: ah I now see what you mean. No, you can't write through it to the real buffer. It's still only a "view" of the buffer. But now you can update properties in it, so I don't think it's "read-only" anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I've tried various ways to keep the original _GenerateView
(and preserve the read-only-ness in OutputCellView
), but I failed to find a way to achieve the same level of performance as the current implementation. I've discussed with @lhecker about this, and he seems to concur with the solution.
This comment has been minimized.
This comment has been minimized.
Hello @lhecker! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request <!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> ## References The `+=` operator is an extremely hot path under heavily output load. This PR aims to optimize its speed. <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [ ] Supports #10563 * [ ] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Documentation updated. If checked, please file a pull request on [our docs repo](https://github.com/MicrosoftDocs/terminal) and link it here: #xxx * [ ] Schema updated. * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed
🎉 Handy links: |
Summary of the Pull Request
References
The
+=
operator is an extremely hot path under heavily output load. This PR aims to optimize its speed.PR Checklist
Detailed Description of the Pull Request / Additional comments
Validation Steps Performed