Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to fix crash on close with saving enabled #11440

Merged
merged 2 commits into from
Oct 20, 2021

Conversation

Rosefield
Copy link
Contributor

@Rosefield Rosefield commented Oct 6, 2021

Don't crash if we try to save the window layout while we are closing, and try to avoid saving at all.

Might impact #11354

Detailed Description of the Pull Request / Additional comments

  • Revoke the event handler/save throttler so we don't even try to get the window layout when we are closing
  • Try to check for nullptrs, but then apply try {} CATCH_LOG() liberally

Validation Steps Performed

The happy path of saving normally is still fine, but I haven't been unlucky enough to trigger the crash myself.

@ghost ghost added Area-Settings Issues related to settings and customizability, for console or terminal Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Product-Terminal The new Windows Terminal. Severity-Blocking We won't ship a release like this! No-siree. Severity-Crash Crashes are real bad news. labels Oct 6, 2021
@@ -723,10 +738,15 @@ winrt::Windows::Foundation::IAsyncOperation<winrt::hstring> AppHost::_GetWindowL
{
winrt::apartment_context peasant_thread;

winrt::hstring layoutJson = L"";
// Use the main thread since we are accessing controls.
co_await winrt::resume_foreground(_logic.GetRoot().Dispatcher());
Copy link
Contributor Author

@Rosefield Rosefield Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this line fail? should that also go in the try {}? If it can fail what state are we even in afterwards?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot fail. At least not in a way that can be caught by try/catch (for instance if GetRoot() were to return a nullptr).

pos.X = window.left - nonClientArea.left;
pos.Y = window.top;
}
CATCH_LOG();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try/catch seems to be redundant with the one on line 746.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt safer to also just put the try/catch here since this function is called in the CloseRequested and WindowCloseButtonClicked event handlers as well, and might be used elsewhere in the future.

LaunchPosition pos{};
// If we started saving before closing, but didn't resume the event handler
// until after _window might be a nullptr.
if (!_window)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is exclusively called on the main thread right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called in a couple of event handlers that may or may not be on the main thread, so I don't know if I can guarantee it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a race condition with whatever sets this field to null then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its why there is still the try/catch, yes. I admittedly don't have a good sense of the full lifetime of the application, especially as things are closing. I believe in the error that @zadjii-msft found it would be on the main thread (assuming the ui thread is also the one that handles destruction) since it went through GetWindowLayoutsAsync.

Copy link
Contributor Author

@Rosefield Rosefield Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more clear, I believe, but have not verified that this is the order of operations in the crash

  • Event handler gets called which gives the peasant an IAsyncOperation
  • The coroutine progresses until it yields on await resume_foreground ...
  • The application starts getting closed because we called PostQuitMessage
  • The event loop in wWinMain exits so the destructor for AppHost is called
  • we set _window = nullptr and then call App->Close.
  • App->Close() tries to clear the event handlers / flush any outstanding requests which includes our event handler in (2)
  • The event handler resumes and then tries to access _window which is now a a nullptr.
  • crash

I have attempted to solve the problem in the following ways

  • Remove the event handler for WindowLayoutRequested before we call PostQuitMessage so hopefully there are no outstanding events when we call our destructor
  • If we still get into the event handler somehow, we check explicitly if the window is a nullptr so that we don't attempt to use it if it is invalid
  • If we fail later (say on usage of _logic) then we just wrap everything in a try catch and just ignore the error. I have not seen a case where this fails, but it might just be that any crash here would've crashed on the _window usage before.

I think that this is sufficiently resilient to avoiding crashes but it is possible that there is something I'm missing still in my understanding of how the runtime manages the lifetime of IAsyncOperations.

Comment on lines +404 to +406
// We don't want to try to save layouts if we are about to close
_getWindowLayoutThrottler.reset();
_windowManager.GetWindowLayoutRequested(_GetWindowLayoutRequestedToken);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a weak_ptr for the GetWindowLayoutRequested even handler instead?

Copy link
Contributor Author

@Rosefield Rosefield Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the event handler can get called during the destructor for the AppHost so technically this is still valid, but half of the state is gone. Removing the event handler makes it so that it (hopefully) can't get called at all. I don't know if revoking the event handler finishes executing outstanding asynchronous events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of a weak_ptr would prevent this. The moment the destructor starts running, a weak_ptr cannot be promoted to a strong com_ptr anymore. So basically if you write code like this:

std::weak_ptr<T> weakSelf{ shared_from_this() };
...
if (const auto self = weakSelf.lock()) {
    ... // <-- This path will not be reached if the destructor is running.
}

Except of course you meant the issue were our WinUI/XAML classes are being Close()d long before they're destroyed... But AppHost is unrelated to WinUI, so this shouldn't apply right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 I'm not sure what to do. Currently AppHost is not std::enabled_shared_from_this, and I don't know if that is a desirable change to make. I'm happy to do whatever, but I'm not confident in my ability to appropriately test this since I haven't been able to reproduce the original exceptions myself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I'm sorry I mixed up std and winrt lingo...
With WinRT we have to use the member function get_weak(), which returns a winrt::weak_ref and to promote it to a strong reference you have to use get() instead of lock().

We can wait for input from other terminal devs. 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AppHost also does not seem to be a winrt type either so it does not have get_weak

@carlos-zamora
Copy link
Member

Tested this on my machine. The crash still occurs :(. Any way I can help?

@Rosefield
Copy link
Contributor Author

Is there a particular stack trace that you are seeing? The only one I am even remotely confident in fixing is this #11354 (comment)

I doubt this fixes any of the things to do with the settings page.

@zadjii-msft
Copy link
Member

Carlos's stack looked like the same as #11354 (comment) unfortunately 😕

@Rosefield
Copy link
Contributor Author

The only thing that stood out to me was
Windows_UI_Xaml!ctl::ComObject<DirectUI::Button>::QueryInterface
in the main call stack, but I don't know if that was related to the 5 errors about WeakReferenceImpl::Resolve.

As much as I'd like to help. and am motivated to not release bugs in my code (if it is), I don't think I have anything to offer.

…this way any concurrent event handlers will finish, and future ones will not include us.
// Remove ourself from the list of peasants so that we aren't included in
// any future requests. This will also mean we block until any existing
// event handler finishes.
_windowManager.SignalClose();
Copy link
Contributor Author

@Rosefield Rosefield Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this which makes many of the other changes potentially redundant (but it doesn't hurt to have more checks). Assuming this call doesn't fail somehow, we will try to acquire the lock in the monarch which means we block until any outstanding event handlers finish, and then afterwards we are removed from the list of peasants so we shouldn't be called again.

Edit: I remembered that I made this function do nothing if we are in the middle of a quit operation, so it isn't quite as thorough. But also before a quit happens we stop the automatic saving so it is probably fine anyways. 🤷 Enforcing all of the invariants I want is a pain.

@DHowett
Copy link
Member

DHowett commented Oct 15, 2021

The bug this is linked to is Sev-Blocking -- @zadjii-msft, do we feel comfortable trying to get this fix -- at least, something like it which makes the code more robust even if it doesn't pinpoint the crash -- into 1.12?

@Rosefield
Copy link
Contributor Author

Is there anything I need to do here?

@zadjii-msft
Copy link
Member

Sorry I've been off in a hundred different places for the last twoish weeks. Lemme try and take a closer look at this. We've got another day to iterate before the release so I think we can land this, even if we don't have the definite root cause of the crashes found.

Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in here all feel safe enough. If I was doing it I probably would have just done the if (!_window){} edit to be more surgical, but since we don't actually fully understand the root cause, I'm not sure we'd get anything from being surgical. The rest of the changes here seem to me to be good defensive things.

We could do the shared_ptr and weak_ptr thing. That does seem a tad bit like overkill, since there's only ever one AppHost (so there was never a need for a pointer to it or any weak references), but this does seem like a valid use case for it. (I similarly didn't bother with making AppHost a winrt type because again, it doesn't really need to be lifetime managed or projected in any meaningful way.)

I might sprinkle additional comments with references to GH#11354 throughout, but 🤷.

I'm cool with doing this as a "hopefully this will make the issue better" for 1.12, and keep our eye out. We'll know soon enough if it didn't work 😄

@Rosefield
Copy link
Contributor Author

To be clear, I'd feel much better if you tested the code out / made modifications that you wanted. I never like making changes that I can't really test or reproduce the original issue.

@zadjii-msft
Copy link
Member

FWIW, I'd love if I had a reproducible version of this to try with. Unfortunately it seems every time I go looking, it goes away 😢 I did build this locally and couldn't seem to hit it while I was testing it out, so that seems promising at least?

@zadjii-msft zadjii-msft added the Needs-Second It's a PR that needs another sign-off label Oct 19, 2021
@DHowett
Copy link
Member

DHowett commented Oct 19, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@DHowett DHowett added the AutoMerge Marked for automatic merge by the bot when requirements are met label Oct 19, 2021
@ghost
Copy link

ghost commented Oct 19, 2021

Hello @DHowett!

Because this pull request has the AutoMerge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@PankajBhojwani
Copy link
Contributor

Force merging despite the x64 build failure because the test failures aren't at all related to this PR

@PankajBhojwani PankajBhojwani merged commit 6bf1507 into microsoft:main Oct 20, 2021
PankajBhojwani pushed a commit that referenced this pull request Oct 20, 2021
Don't crash if we try to save the window layout while we are closing, and try to avoid saving at all.

Might impact #11354 

## Detailed Description of the Pull Request / Additional comments
- Revoke the event handler/save throttler so we don't even try to get the window layout when we are closing
- Try to check for nullptrs, but then apply `try {} CATCH_LOG()` liberally

## Validation Steps Performed
The happy path of saving normally is still fine, but I haven't been unlucky enough to trigger the crash myself.
@Rosefield Rosefield deleted the bug/dont-crash-on-close branch October 20, 2021 00:56
@@ -401,23 +401,42 @@ void AppHost::LastTabClosed(const winrt::Windows::Foundation::IInspectable& /*se
_HideNotificationIconRequested();
}

// We don't want to try to save layouts if we are about to close
_getWindowLayoutThrottler.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh jeez, this introduced a new one. I'll file it

@ghost ghost removed the Needs-Second It's a PR that needs another sign-off label Nov 4, 2021
ghost pushed a commit that referenced this pull request Dec 6, 2021
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? -->
## Summary of the Pull Request
Clean up an invalid access that I introduced in #11440 

<!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> 
## References

<!-- Please review the items on the PR checklist before submitting-->
## PR Checklist
* [x] Closes #11684 
* [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA
* [ ] Tests added/passed
* [ ] Documentation updated. If checked, please file a pull request on [our docs repo](https://github.com/MicrosoftDocs/terminal) and link it here: #xxx
* [ ] Schema updated.
* [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

<!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here -->
## Detailed Description of the Pull Request / Additional comments

<!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well -->
## Validation Steps Performed
DHowett pushed a commit that referenced this pull request Dec 13, 2021
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? -->
## Summary of the Pull Request
Clean up an invalid access that I introduced in #11440

<!-- Other than the issue solved, is this relevant to any other issues/existing PRs? -->
## References

<!-- Please review the items on the PR checklist before submitting-->
## PR Checklist
* [x] Closes #11684
* [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA
* [ ] Tests added/passed
* [ ] Documentation updated. If checked, please file a pull request on [our docs repo](https://github.com/MicrosoftDocs/terminal) and link it here: #xxx
* [ ] Schema updated.
* [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

<!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here -->
## Detailed Description of the Pull Request / Additional comments

<!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well -->
## Validation Steps Performed

(cherry picked from commit 29e6235)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Settings Issues related to settings and customizability, for console or terminal AutoMerge Marked for automatic merge by the bot when requirements are met Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Product-Terminal The new Windows Terminal. Severity-Blocking We won't ship a release like this! No-siree. Severity-Crash Crashes are real bad news.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants