Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RWLock fallbacks seem to cause lockup in Firefox #32

Closed
K4sum1 opened this issue Jul 10, 2024 · 15 comments
Closed

RWLock fallbacks seem to cause lockup in Firefox #32

K4sum1 opened this issue Jul 10, 2024 · 15 comments

Comments

@K4sum1
Copy link

K4sum1 commented Jul 10, 2024

I tried to fork Rust9x to completely remove TryAcquireSRWLock and either replace the two functions with non-Try versions or critical sections.

However when I look into the code, it seems it tries to use them first, and if it can't find the functions, it then falls back to critical sections.

Something here is funny though, because in Firefox compiled with rust9x on a system without TryAcquireSRWLock but otherwise has SRW locks, the browser just freezes after a minute. If I hex edit out the Try part, the browser just works fine.

So either the fallback fails and causes the whole thing to freeze, or something fails when it hits critical sections, or something else idk.

If this is only an issue with Firefox, please give me some pointers as to how I could trace the issue, because I literally have no mentions of TryAcquireSRWLock anywhere in the source code of my modified version. This call is only from Rust.

Also idk if intentional, but when replicating this commit 08798a9 I noticed that in c.rs when you add TryEnterCriticalSection, you don't also replicate it in windows_sys.rs. Now I notice the code for the function itself looks the same as any other, so I assumed that if it's there in c.rs, it doesn't need to be in windows_sys.rs. However I also notice you just left all the SRW lock and ConditionVariable functions in windows_sys.rs. I'm not sure if this is a mistake or intentional, but it doesn't look right to me.

@K4sum1
Copy link
Author

K4sum1 commented Jul 10, 2024

I found a single try_lock call in Firefox, and I was able to easily remove it, but the issue still occurs. I have tested with hex editing out Try that it still works and it's not my fix.

@seritools
Copy link
Member

seritools commented Jul 11, 2024

However I also notice you just left all the SRW lock and ConditionVariable functions in windows_sys.rs. I'm not sure if this is a mistake or intentional, but it doesn't look right to me.

I keep the changes to windows_sys.lst/rs as small as possible to make rebasing easier in the future. Unused declarations don't make it into the binary.

So either the fallback fails and causes the whole thing to freeze,

You could try forcing different Mutex kinds here and see if the behavior changes:


I have tested the basic lock detection on Win 98SE (CreateMutex), Win XP SP3 (CriticalSection), Win 11 (SRWLocks) via the sample program insofar that they seem to work fine and lock and unlock properly. There have not been any stress tests or anything, so I can't guarantee that there isn't something wrong.

If this is only an issue with Firefox, please give me some pointers as to how I could trace the issue

No idea if it's just firefox, but you could trace the loading/init process and hook the init code or even GetProcAddress to figure out where it loads the SRWLock stuff. Also could just break whenever it accesses one of the sync functions you're interested in.

Also idk if intentional, but when replicating this commit 08798a9 I noticed that in c.rs when you add TryEnterCriticalSection, you don't also replicate it in windows_sys.rs.

Only things that will be added to the statically dllimported (= land in the import table header of the binary) have to be added to windows_sys.lst/rs. Since TryEnterCriticalSection is optional and not exported on all systems, it is loaded at runtime. If i didn't do that, binaries would fail loading with "Import kernel32!TryEnterCriticalSection not found" or whatever the error message from the windows loader is.

Notice how every function (excluding WinSock) I added to windows_sys.lst is available on every single 32-bit Windows kernel.

@K4sum1
Copy link
Author

K4sum1 commented Jul 12, 2024

I keep the changes to windows_sys.lst/rs as small as possible to make rebasing easier in the future. Unused declarations don't make it into the binary.

Only things that will be added to the statically dllimported (= land in the import table header of the binary) have to be added to windows_sys.lst/rs. Since TryEnterCriticalSection is optional and not exported on all systems, it is loaded at runtime. If i didn't do that, binaries would fail loading with "Import kernel32!TryEnterCriticalSection not found" or whatever the error message from the windows loader is.

Notice how every function (excluding WinSock) I added to windows_sys.lst is available on every single 32-bit Windows kernel.

These seem to conflict, but I guess it works.

You could try forcing different Mutex kinds here and see if the behavior changes:

Next time I build Rust, I'll try that.

No idea if it's just firefox, but you could trace the loading/init process and hook the init code or even GetProcAddress to figure out where it loads the SRWLock stuff. Also could just break whenever it accesses one of the sync functions you're interested in.

Debug builds don't work too well on Vista, but I figure worst case I can just malform the TryAcquireSRWLock functions by hex editing to not match on 7+ and try that way.

@seritools
Copy link
Member

Let me know if you figure out more! 🙏

@K4sum1
Copy link
Author

K4sum1 commented Jul 12, 2024

I force set it to CriticalSection, and I have the issue on both Vista and 10, and hex editing the TryAcquireSRWLock call does not fix it.

I'll try Legacy tomorrow, I am too tired right now.

@seritools seritools changed the title TryAcquireSRWLock fallback might be funny? CriticalSection fallback causes lockup in Firefox Jul 12, 2024
@K4sum1
Copy link
Author

K4sum1 commented Jul 13, 2024

Legacy mutex's seemed to last longer, but I still got the same problem. The browser works from anywhere from a few seconds to a few minutes before freezing.

I did get the funniest error in testing though

[Child 6968, MediaDecoderStateMachine #1] WARNING: 1765e560 Could not set cubeb stream name.: file C:/Users/Admin/Documents/GitHub/r3dfox/dom/media/AudioStream.cpp:321
[Child 6968, MediaDecoderStateMachine #1] WARNING: 1ca72e80 Could not set cubeb stream name.: file C:/Users/Admin/Documents/GitHub/r3dfox/dom/media/AudioStream.cpp:321
CCCCCCrrrrrraaaaaasssssshhhhhh      AAAAAAnnnnnnnnnnnnoooooottttttaaaaaattttttiiiiiioooooonnnnnn      GGGGGGrrrrrraaaaaapppppphhhhhhiiiiiiccccccssssssCCCCCCrrrrrriiiiiittttttiiiiiiccccccaaaaaallllllEEEEEErrrrrrrrrrroorooorrorrr::r:::  :   || |||[[|[[[CC[CCC00C000]]0]]][[][[[GG[GGGFFGFFFXXFXX
X11X111--1---]]-]]]::]:::  :   CC CCCooCooommommmppmpppoopooossossisisiititittototoorororrBrBrBBrBrBrrirriidiiiddgddgdgeggegeCeCeeChhCCCihhihlhiilidildll l dddrdr   e errrcrceeeeeecccicieeevveiieieivvsvsvee e essIsIs  P P IICICIPP  PPCcCcCCl  l  occoccsllslleooeoo ss ssweeweei  i  twwtwwhiih
ii tt ttrhhrhhe e   ararrsseerooeaaennassa==soosAAonnobb=n=nnnA=A=oobAbArrnbnbmmononaaorrollrmmrSSmaamhhallauulSSlttShhSddhuuhoouttuwwtddtnndood  owwo((wnnwttn  n== (( 11(tt(85t==t78=11=912781..585877334758...))6161  .7519[[9))5GG)  )FF [ [XX[G[11GGFG--FFXF]]XX1X::11-  1--]CC-]]:oo]:: mm:  C
pp CCooCooosmmomsipmppiotpootsoossoirsiirtBitBtortororiorirBdrBdBrgBrgrierieidCidCdghdghgeigeieCleClChdChdhi hi ilrilrldelded cd c re rereireiecvecvceeceeeiseisiv iv veIveIesPesPs Cs C I  I IPcIPcPClPCl oCC ocs  csleccleo llo swooswiesseit ee thw  wh iwwi rtiitrehtthea hh asr  rsoerreonaeean
=asa=sAsosAobonobnnn=nn=o=A=oArAbrAbmbnmbnanoanolorlorSrmSrmhmahmuaaluatllStldSShdSohhuohwuutwunttdnt
ddo
odwowonwnw
n
n

@seritools
Copy link
Member

seritools commented Jul 13, 2024

I wonder if Firefox (probably rightly?) assumes that an RWLock can be read-locked multiple times. The fallbacks to a critical section or mutex of course won't allow that, which might cause a deadlock. In that case Rust9x would need to implement a completely separate fallback RWLock implementation, e.g. based on CreateSemaphore (write lock = acquiring tickets/counts) or other primitives supported on older systems (maybe something from here works)

@K4sum1
Copy link
Author

K4sum1 commented Jul 13, 2024

In that case Rust9x would need to implement a completely separate fallback RWLock implementation, e.g. based on CreateSemaphore (write lock = acquiring tickets/counts) or other primitives supported on older systems (maybe something from here works)

You might be right, I was also experimenting with the Rust used to create Mypal68 (Firefox 68 for XP) and I noticed Semaphores are used in it. If you want to see the patch for it, here it is.

Eclipse-Community@a7d5b8c

It is 1.45.2 though, and I think I would need at minimum 1.66.0 for 115, and 1.76.0 for 128. I originally wanted to make a fork of 1.77.2 since that works for both 115 and 128, but I couldn't figure out how to port the changes to 1.77.2, however I was kinda doing it wrong in the first place so I could probably do it if I tried again.

@seritools
Copy link
Member

Some of the implementations in that commit look super interesting!

The RWLock impl there however doesn't seem to use semaphores or allow multiple read locks at the same time either, it also just falls back to std's old ReentrantMutex implementation. That implementation doesn't exist anymore in current Rust std (PR). That old reentrant mutex impl used critical sections, but the wrapper in SRWLock around it prevents reentrant locking too.

@K4sum1
Copy link
Author

K4sum1 commented Jul 13, 2024

I also suspect that the freeze is GFX related, and I know modern Firefox uses WebRender, which Mypal68 doesn't have. I would love to try building an older Firefox version that still allowed for disabling WebRender, however finding all the dependencies for that sounds like hell.

@seritools
Copy link
Member

Hmmm, I'm pretty sure my current RwLock fallback implementation is just wrong (and the one in your linked commit too!). This example should panic immmediately, as the fallbacks guard against reentrancy. I've created #33 for it.

Not sure when I get to updating rust9x though at the moment.

@K4sum1
Copy link
Author

K4sum1 commented Jul 13, 2024

I guess I can look at the rwlock implementations in Firefox and see if anything would cause a crash or have this issue.

@seritools seritools changed the title CriticalSection fallback causes lockup in Firefox RWLock fallbacks seem to cause lockup in Firefox Jul 15, 2024
@K4sum1
Copy link
Author

K4sum1 commented Jul 16, 2024

I don't know what I am looking for.

I've traced it to this file as it's the only WebRender file to use std::sync::RwLock, but I'm not sure what in it would be.

https://github.com/mozilla/gecko-dev/blob/release/gfx/wr/wr_glyph_rasterizer/src/rasterizer.rs#L28

@seritools
Copy link
Member

I mean the RWLock fallback implementations are just incorrect.

You could try replacing the Arc<RwLock<..>> with an Arc<Mutex<...>> and test again. If the broken RWLock fallback is the cause, the Arc Mutex variant should also lock up on modern systems.

@K4sum1
Copy link
Author

K4sum1 commented Jul 16, 2024

I mean the RWLock fallback implementations are just incorrect.

Well, yeah, but it still partially works. So I would want to mitigate the issue for now.

@rust9x rust9x locked and limited conversation to collaborators Jul 16, 2024
@seritools seritools converted this issue into discussion #35 Jul 16, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants