Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layer_shell: keep output non-NULL wherever possible #6820

Merged
merged 1 commit into from
Apr 13, 2022

Conversation

tchebb
Copy link
Contributor

@tchebb tchebb commented Feb 6, 2022

NOTE: Please review this PR carefully, as I am not confident enough in my understanding of the lifecycle of the objects in question to be sure that I'm not introducing some other bug (crash, memory leak, etc) with this change.

Our layer shell implementation assigns every layer surface to an output on creation. It tracks this output using the output field on the underlying wlr_layer_surface_v1 structure. As such, much of the existing code assumes that output is always non-NULL and omits NULL checks accordingly.

However, there are currently two cases where we destroy a sway_layer_surface and output is NULL. The first is when we can't find an output to assign the surface to and destroy it immediately after creation. The second is when we destroy a surface in response to its output getting destroyed, as we set output to NULL in handle_output_destroy() before we call wlr_layer_surface_v1_destroy(), which is what calls the appropriate unmap and destroy callbacks.

The former case doesn't cause any problems, since we haven't even allocated a sway_layer_surface at that point or registered any callbacks. The latter case, however, currently triggers a crash (#6120) if a popup is visible, since our popup_handle_unmap() implementation can't handle a NULL output.

To fix this issue, keep output set until right before we free the sway_layer_surface. All we need to do is remove some of the cleanup logic from handle_output_destroy(), since as of commit c9060bc ("layer-shell: replace close() with destroy()") that same logic is guaranteed to be happen later when wlroots calls handle_destroy() as part of wlr_layer_surface_v1_destroy().

This lets us remove some NULL checks from other unmap/destroy callbacks, which is nice. We also don't need to check that the wlr_output points to a valid sway_output anymore, since we unset that pointer after disabling the output as of commit a0bbe67 ("Address emersions comments on output re-enabling") Just to be safe, I've added assertions that the wlr_output is non-NULL wherever we use it.

Fixes #6120.

Our layer shell implementation assigns every layer surface to an output
on creation. It tracks this output using the output field on the
underlying wlr_layer_surface_v1 structure. As such, much of the existing
code assumes that output is always non-NULL and omits NULL checks
accordingly.

However, there are currently two cases where we destroy a
sway_layer_surface and output is NULL. The first is when we can't find
an output to assign the surface to and destroy it immediately after
creation. The second is when we destroy a surface in response to its
output getting destroyed, as we set output to NULL in
handle_output_destroy() before we call wlr_layer_surface_v1_destroy(),
which is what calls the appropriate unmap and destroy callbacks.

The former case doesn't cause any problems, since we haven't even
allocated a sway_layer_surface at that point or registered any
callbacks. The latter case, however, currently triggers a crash (swaywm#6120)
if a popup is visible, since our popup_handle_unmap() implementation
can't handle a NULL output.

To fix this issue, keep output set until right before we free the
sway_layer_surface. All we need to do is remove some of the cleanup
logic from handle_output_destroy(), since as of commit c9060bc
("layer-shell: replace close() with destroy()") that same logic is
guaranteed to be happen later when wlroots calls handle_destroy() as
part of wlr_layer_surface_v1_destroy().

This lets us remove some NULL checks from other unmap/destroy callbacks,
which is nice. We also don't need to check that the wlr_output points to
a valid sway_output anymore, since we unset that pointer after disabling
the output as of commit a0bbe67 ("Address emersions comments on
output re-enabling") Just to be safe, I've added assertions that the
wlr_output is non-NULL wherever we use it.

Fixes swaywm#6120.
@tchebb
Copy link
Contributor Author

tchebb commented Feb 6, 2022

Updated to also remove NULL checks in handle_destroy(), since I don't think those are needed either after this change.

@Nefsen402
Copy link
Member

FYI I am currently running the effort of doing the migration to wlr_scene #6562. My branch rewrites this code in question. I haven't look at the code too closely but it seems to me like this patch fixes a NULL deference but introduces a use-after-free/assert. If an output got disabled while a popup was active, the sway_output is now freed, and since we're no longer nulling out the sway_output on line 286 things will start trying to read/write from garbage memory potentially if a transaction happens in the meantime. On the flip side, if a popup happens while all outputs are disabled, then handle_layer_shell_surface won't assign a output and NULL will start pinging off the walls anyway.

@tchebb
Copy link
Contributor Author

tchebb commented Feb 12, 2022

in the meantime

Can you elaborate on exactly which "meantime" you mean here? It's quite possible that this code introduces a potential use-after-free; like I noted, I'm not totally confident in my understanding of the lifecycle here.

My current understanding is that this change is safe because we still NULL out the pointer and remove the appropriate list items later on, in handle_destroy() (see line 373). I think handle_destroy() runs synchronously shortly after handle_output_destroy() because of the latter's call to wlr_layer_surface_v1_destroy(), which signals our sway_layer's destroy callback. Please correct me if I'm wrong, though.

@tchebb
Copy link
Contributor Author

tchebb commented Feb 12, 2022

if a popup happens while all outputs are disabled, then handle_layer_shell_surface won't assign a output and NULL will start pinging off the walls anyway

I did consider this case as well, and my commit message addresses it:

The former case doesn't cause any problems, since we haven't even allocated a sway_layer_surface at that point or registered any callbacks.

Edit: actually, what I considered was when an entirely new layer appears while all outputs are disabled, not when a popup of an existing layer appears. But I don't think the latter case can ever happen, since we destroy a layer as soon as its output goes away. Is there any time in which it would have a chance to make a popup after the output goes away but before it's destroyed?

@Nefsen402
Copy link
Member

Can you elaborate on exactly which "meantime" you mean here? It's quite possible that this code introduces a potential use-after-free; like I noted, I'm not totally confident in my understanding of the lifecycle here.

Thinking about this more, I don't think this is possible because wl_signals get called synchronously. This is what I was thinking:
Transactions in sway get committed when all the surfaces ack their configure events. When an output disappears, then sway will rearrange all the windows sending configure events. When all the surfaces acked the reconfigures, then that's when the old sway_output will get freed (along with friends). It might be possible for the sway_output to get freed before the layer_shell gets notified of the output destroy - but isn't actually.

actually, what I considered was when an entirely new layer appears while all outputs are disabled, not when a popup of an existing layer appears. But I don't think the latter case can ever happen, since we destroy a layer as soon as its output goes away. Is there any time in which it would have a chance to make a popup after the output goes away but before it's destroyed?

Right, nevermind. This case isn't possible.

This LGTM.

@Nefsen402 Nefsen402 mentioned this pull request Feb 21, 2022
5 tasks
@tchebb
Copy link
Contributor Author

tchebb commented Apr 12, 2022

Since #6844 is still under development, can we merge this in the meantime? Also nice to have a more correct version of the old code in the Git history, even once the rewrite replaces it. cc @emersion

Copy link
Member

@emersion emersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this LGTM!

@emersion emersion merged commit d726e50 into swaywm:master Apr 13, 2022
@MayeulC
Copy link

MayeulC commented Jun 8, 2022

Thanks a lot for this fix!

I'm wondering when it will land in a release? I just hit that issue (#6855 to be precise, I disconnected a laptop from the docking station without opening it), and it seems to be hit quite frequently, going by the comments on various issues.

What are the release plans for the next version? If the next release is far away, would you mind doing a point release with just this change (and possibly other fixes)?

Thank you again for your work!

@tchebb tchebb deleted the fix-6120 branch April 18, 2023 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Segfault in layer_shell.c : popup_damage when laptop output is disabled
4 participants