-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to fix crash on close with saving enabled #11440
Try to fix crash on close with saving enabled #11440
Conversation
…and try to avoid saving at all
@@ -723,10 +738,15 @@ winrt::Windows::Foundation::IAsyncOperation<winrt::hstring> AppHost::_GetWindowL | |||
{ | |||
winrt::apartment_context peasant_thread; | |||
|
|||
winrt::hstring layoutJson = L""; | |||
// Use the main thread since we are accessing controls. | |||
co_await winrt::resume_foreground(_logic.GetRoot().Dispatcher()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this line fail? should that also go in the try {}
? If it can fail what state are we even in afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot fail. At least not in a way that can be caught by try/catch (for instance if GetRoot()
were to return a nullptr
).
pos.X = window.left - nonClientArea.left; | ||
pos.Y = window.top; | ||
} | ||
CATCH_LOG(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try/catch seems to be redundant with the one on line 746.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It felt safer to also just put the try/catch here since this function is called in the CloseRequested
and WindowCloseButtonClicked
event handlers as well, and might be used elsewhere in the future.
LaunchPosition pos{}; | ||
// If we started saving before closing, but didn't resume the event handler | ||
// until after _window might be a nullptr. | ||
if (!_window) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is exclusively called on the main thread right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is called in a couple of event handlers that may or may not be on the main thread, so I don't know if I can guarantee it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a race condition with whatever sets this field to null then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its why there is still the try/catch, yes. I admittedly don't have a good sense of the full lifetime of the application, especially as things are closing. I believe in the error that @zadjii-msft found it would be on the main thread (assuming the ui thread is also the one that handles destruction) since it went through GetWindowLayoutsAsync
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more clear, I believe, but have not verified that this is the order of operations in the crash
- Event handler gets called which gives the peasant an
IAsyncOperation
- The coroutine progresses until it yields on
await resume_foreground ...
- The application starts getting closed because we called
PostQuitMessage
- The event loop in
wWinMain
exits so the destructor forAppHost
is called - we set
_window = nullptr
and then callApp->Close
. App->Close()
tries to clear the event handlers / flush any outstanding requests which includes our event handler in (2)- The event handler resumes and then tries to access
_window
which is now a a nullptr. - crash
I have attempted to solve the problem in the following ways
- Remove the event handler for
WindowLayoutRequested
before we callPostQuitMessage
so hopefully there are no outstanding events when we call our destructor - If we still get into the event handler somehow, we check explicitly if the window is a nullptr so that we don't attempt to use it if it is invalid
- If we fail later (say on usage of
_logic
) then we just wrap everything in a try catch and just ignore the error. I have not seen a case where this fails, but it might just be that any crash here would've crashed on the_window
usage before.
I think that this is sufficiently resilient to avoiding crashes but it is possible that there is something I'm missing still in my understanding of how the runtime manages the lifetime of IAsyncOperations
.
// We don't want to try to save layouts if we are about to close | ||
_getWindowLayoutThrottler.reset(); | ||
_windowManager.GetWindowLayoutRequested(_GetWindowLayoutRequestedToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a weak_ptr
for the GetWindowLayoutRequested
even handler instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the event handler can get called during the destructor for the AppHost
so technically this
is still valid, but half of the state is gone. Removing the event handler makes it so that it (hopefully) can't get called at all. I don't know if revoking the event handler finishes executing outstanding asynchronous events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a weak_ptr
would prevent this. The moment the destructor starts running, a weak_ptr
cannot be promoted to a strong com_ptr
anymore. So basically if you write code like this:
std::weak_ptr<T> weakSelf{ shared_from_this() };
...
if (const auto self = weakSelf.lock()) {
... // <-- This path will not be reached if the destructor is running.
}
Except of course you meant the issue were our WinUI/XAML classes are being Close()
d long before they're destroyed... But AppHost
is unrelated to WinUI, so this shouldn't apply right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 I'm not sure what to do. Currently AppHost
is not std::enabled_shared_from_this
, and I don't know if that is a desirable change to make. I'm happy to do whatever, but I'm not confident in my ability to appropriately test this since I haven't been able to reproduce the original exceptions myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I'm sorry I mixed up std
and winrt
lingo...
With WinRT we have to use the member function get_weak()
, which returns a winrt::weak_ref
and to promote it to a strong reference you have to use get()
instead of lock()
.
We can wait for input from other terminal devs. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AppHost
also does not seem to be a winrt type either so it does not have get_weak
Tested this on my machine. The crash still occurs :(. Any way I can help? |
Is there a particular stack trace that you are seeing? The only one I am even remotely confident in fixing is this #11354 (comment) I doubt this fixes any of the things to do with the settings page. |
Carlos's stack looked like the same as #11354 (comment) unfortunately 😕 |
The only thing that stood out to me was As much as I'd like to help. and am motivated to not release bugs in my code (if it is), I don't think I have anything to offer. |
…this way any concurrent event handlers will finish, and future ones will not include us.
// Remove ourself from the list of peasants so that we aren't included in | ||
// any future requests. This will also mean we block until any existing | ||
// event handler finishes. | ||
_windowManager.SignalClose(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this which makes many of the other changes potentially redundant (but it doesn't hurt to have more checks). Assuming this call doesn't fail somehow, we will try to acquire the lock in the monarch which means we block until any outstanding event handlers finish, and then afterwards we are removed from the list of peasants so we shouldn't be called again.
Edit: I remembered that I made this function do nothing if we are in the middle of a quit
operation, so it isn't quite as thorough. But also before a quit happens we stop the automatic saving so it is probably fine anyways. 🤷 Enforcing all of the invariants I want is a pain.
The bug this is linked to is Sev-Blocking -- @zadjii-msft, do we feel comfortable trying to get this fix -- at least, something like it which makes the code more robust even if it doesn't pinpoint the crash -- into 1.12? |
Is there anything I need to do here? |
Sorry I've been off in a hundred different places for the last twoish weeks. Lemme try and take a closer look at this. We've got another day to iterate before the release so I think we can land this, even if we don't have the definite root cause of the crashes found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes in here all feel safe enough. If I was doing it I probably would have just done the if (!_window){}
edit to be more surgical, but since we don't actually fully understand the root cause, I'm not sure we'd get anything from being surgical. The rest of the changes here seem to me to be good defensive things.
We could do the shared_ptr
and weak_ptr
thing. That does seem a tad bit like overkill, since there's only ever one AppHost
(so there was never a need for a pointer to it or any weak references), but this does seem like a valid use case for it. (I similarly didn't bother with making AppHost a winrt type because again, it doesn't really need to be lifetime managed or projected in any meaningful way.)
I might sprinkle additional comments with references to GH#11354 throughout, but 🤷.
I'm cool with doing this as a "hopefully this will make the issue better" for 1.12, and keep our eye out. We'll know soon enough if it didn't work 😄
To be clear, I'd feel much better if you tested the code out / made modifications that you wanted. I never like making changes that I can't really test or reproduce the original issue. |
FWIW, I'd love if I had a reproducible version of this to try with. Unfortunately it seems every time I go looking, it goes away 😢 I did build this locally and couldn't seem to hit it while I was testing it out, so that seems promising at least? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Hello @DHowett! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
Force merging despite the x64 build failure because the test failures aren't at all related to this PR |
Don't crash if we try to save the window layout while we are closing, and try to avoid saving at all. Might impact #11354 ## Detailed Description of the Pull Request / Additional comments - Revoke the event handler/save throttler so we don't even try to get the window layout when we are closing - Try to check for nullptrs, but then apply `try {} CATCH_LOG()` liberally ## Validation Steps Performed The happy path of saving normally is still fine, but I haven't been unlucky enough to trigger the crash myself.
@@ -401,23 +401,42 @@ void AppHost::LastTabClosed(const winrt::Windows::Foundation::IInspectable& /*se | |||
_HideNotificationIconRequested(); | |||
} | |||
|
|||
// We don't want to try to save layouts if we are about to close | |||
_getWindowLayoutThrottler.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh jeez, this introduced a new one. I'll file it
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request Clean up an invalid access that I introduced in #11440 <!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> ## References <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [x] Closes #11684 * [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Documentation updated. If checked, please file a pull request on [our docs repo](https://github.com/MicrosoftDocs/terminal) and link it here: #xxx * [ ] Schema updated. * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? --> ## Summary of the Pull Request Clean up an invalid access that I introduced in #11440 <!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> ## References <!-- Please review the items on the PR checklist before submitting--> ## PR Checklist * [x] Closes #11684 * [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Documentation updated. If checked, please file a pull request on [our docs repo](https://github.com/MicrosoftDocs/terminal) and link it here: #xxx * [ ] Schema updated. * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx <!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here --> ## Detailed Description of the Pull Request / Additional comments <!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well --> ## Validation Steps Performed (cherry picked from commit 29e6235)
Don't crash if we try to save the window layout while we are closing, and try to avoid saving at all.
Might impact #11354
Detailed Description of the Pull Request / Additional comments
try {} CATCH_LOG()
liberallyValidation Steps Performed
The happy path of saving normally is still fine, but I haven't been unlucky enough to trigger the crash myself.