Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Android pthread_attr_getstack() includes the guard pages in the returned stack size #1535

Open
thestinger opened this issue Oct 12, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@thestinger
Copy link

Bug Description

64-bit ARM requires 64k as the minimum guard size since it uses this as the default stack probe size to reduce the overhead of -fstack-clash-protection compared to a 4k size. GrapheneOS uses 64k as the default guard size and adjusts a guard size below 64k to 64k similarly to glibc. Hermes handles this incorrectly and crashes with a stack overflow error. This is causing apps which update to the new version to crash on GrapheneOS, reducing app compatibility. This is a severe issue and you're requiring that operating systems have a security vulnerability in order to be compatible with Hermes.

We plan on reporting the lack of the required minimum guard size as a security vulnerability in Android, therefore it should get fixed in a monthly security update. Apps using React Native / Hermes would then break with the stock OS across devices if they update to this version. We'll be reported it soon and it usually takes them a minimum of 3 months to ship a patch, although probably longer for something like this. The issue we see is you don't control when apps update and they're currently slowly updating or starting out using a broken version which they may not upgrade from for years.

Steps To Reproduce

Run apps using Hermes on one of the supported GrapheneOS devices and notice the crash from the stack overflow detection having a false positive.

https://grapheneos.org/faq#recommended-devices
https://grapheneos.org/install/web

The Expected Behavior

Should take the guard size into account properly and avoid crashing. The actual guard size that's used needs to be required rather than assuming it isn't rounded up higher than 4k.

@thestinger
Copy link
Author

We're working around this by going back to a 4k stack guard temporarily. We can add detection of the Hermes library and downgrade security for apps using it in the future. It's very problematic though since the ABI says the stack probe size is 64k so even if we use 4k that doesn't mean everything will use 4k.

@tmikov tmikov added need more info Awating additional info before proceeding and removed bug Something isn't working labels Oct 12, 2024
@tmikov
Copy link
Contributor

tmikov commented Oct 12, 2024

I am sorry, we will be happy to fix this, but can you explain in more detail what the actual problem in Hermes is? When is it crashing and how? What is the error message? Do you have a symbolicated stack trace? Do you have a minimal repro?

@thestinger
Copy link
Author

This is the stack overflow detection code that's buggy:

std::pair<const void *, size_t> thread_stack_bounds(unsigned gap) {

bool isStackOverflowingSlowPath() {

It doesn't take into account the guard size in the calculations, which isn't always 4k. It works with a 4k or 16k guard but breaks on the way to 64k. 64-bit ARM on Linux has standardized a 64k guard since it's better for security but Android hasn't moved to it from 4k. GrapheneOS uses a 64k guard, but we're working on a new release which among other changes temporarily rolls back to 4k due to this issue in Hermes. If you test on a non-end-of-life Pixel with the current release of GrapheneOS you can replicate the crash. You could also manually set a 64k guard size in your code and see that it results in a crash if you want an easy way to do it without GrapheneOS. Hermes is currently using the default guard size, which is 64k rather than 4k on GrapheneOS. However, setting it manually to 4k would also get 64k since it's a minimum value just like glibc for arm64.

@thestinger
Copy link
Author

The calculation wrongly thinks the stack has overflowed and Hermes aborts, wrongly thinking there's an overflow. Do not know the exact details of how it goes wrong, just that it breaks with the default guard size set to 64k instead of 4k which is definitely a bug. It's possible it just doesn't allocate large enough stacks for a 64k guard and needs to factor it in. Not sure exactly what the code is doing, not at all familiar with it.

@tmikov
Copy link
Contributor

tmikov commented Oct 12, 2024

I am sorry, you will have to work closely with us here, since we don't have Graphene installed. Let's try to figure out what is going on.

I am looking at the code and I don't see how the guard size is relevant at all for this check. The code is pretty simple. It obtains the stack bounds, subtracts a configurable "gap" (64KB by default) for additional safety, and checks whether the stack pointer falls within that region.

So, for the check to be failing, the stack bounds must be incorrect. How is that possible? And why does the size of the guard matter at all?

@thestinger
Copy link
Author

You can explicitly set the guard size to 64k when creating the threads and it should fail in the same way.

@tmikov
Copy link
Contributor

tmikov commented Oct 12, 2024

I just tried that on arm64 Linux: https://gist.github.com/tmikov/5cc745cf16cc35118fb459b0d4414f85. It worked as expected.

We really can't do anything more about this until we can understand what the bug is.

@thestinger
Copy link
Author

It needs to be done on Android. Bionic handles things differently than glibc.

@tmikov
Copy link
Contributor

tmikov commented Oct 13, 2024

@thestinger it appears that Bionic has a bug and it includes the guard size in the reported stack size. The problem is not in the size being 64KB specifically, but that we assume that the entire reported stack size is valid. Also, the problem appears only when there is an actual stack overflow.

We will work to fix this.

@tmikov tmikov changed the title bug in stack overflow check breaks Android apps operating systems following the standard 64-bit ARM ABI for minimum guard size including GrapheneOS, will impact everything based on AOSP in the future too On Android pthread_attr_getstack() includes the guard pages in the returned stack size Oct 13, 2024
@tmikov
Copy link
Contributor

tmikov commented Oct 13, 2024

To be clear, the problem I discovered has different symptoms than what you are describing. The problem only appears when there is an actual stack overflow. Our stack checking fails to catch it, resulting in a sigsegv. Sure that is bad, but the app was already crashing.

@thestinger
Copy link
Author

That's different from what I'm describing since Bionic adds space for the stack guard so there should be just as much space available with a larger stack guard and it shouldn't cause an overflow, Also, the stack checking is what's causing the abort we see rather than a stack overflow.

@thestinger
Copy link
Author

@tmikov That's not what I was reporting here. I suggest not working around how Android reports the guard page as part of pthread_attr_getstack because that can be changed. It seems like the opposite of what I'm reporting which is the stack check triggering prematurely, not too late.

@tmikov
Copy link
Contributor

tmikov commented Oct 13, 2024

@thestinger I wrote a small test app, creating a thread with 64KB stack guard size. I tested it on MacOS, Linux and Android, all arm64. It worked normally on all platforms and did not trigger a stack overflow prematurely. I do not see how any guard size can trigger a premature stack overflow. The stack overflow check is simply a comparison of the current stack pointer with the bounds of the stack. The guard size is irrelevant in that calculation.

I especially don't see how it can cause a premature SIGSEGV.

I did however discover that, only on Android, stack overflow is not correctly detected with 64KB guard size, causing a crash instead of a controlled JS exception. This is explained fully by the difference of behavior of pthread_attr_getstack(), most likely a Bionic bug.

At the moment I am satisfied that I understand what I am seeing and I have a workaround for it. I am grateful that we caught this problem. Again, it only affects code that is experiencing an actual stack overflow, causing a SIGSEGV instead of gracefully throwing a JS exception. This is not great, but that code was already broken.

If you are seeing different kinds of crashes, there must be a different explanation. Happy to work with you on it, but we need more information on order to diagnose it. Unfortunately for now we can't install GrapheneOS on our devices, so we will have to rely on you to run some experiments.

If you can run https://gist.github.com/tmikov/fc3e91a16b9debb8877e4257a551fd08 in GrapheneOS, it will be interesting to observe the behavior in these two cases:

  • stack v guard : run native recursion with 64KB guard. checking for stack overflow
  • stack m guard: create a thread with 64KB guard and manually probe the stack

@thestinger
Copy link
Author

I did however discover that, only on Android, stack overflow is not correctly detected with 64KB guard size, causing a crash instead of a controlled JS exception. This is explained fully by the difference of behavior of pthread_attr_getstack(), most likely a Bionic bug.

We know exactly why this happens but changing the behavior breaks compatibility with apps. In general, behavior like this could only be changed for a new target API level. Android itself has code in ART depending on it working the way that it does. It's because they set both the internal mmap base and stack base to the mapping base instead of offsetting the stack base. It can't be easily changed due to compatibility needing to be preserved. Could try to convince them to change it for a new target API level but they have bigger things to worry about. It shouldn't break many apps and the apps it breaks should be able to deal with it since they're messing with low level, subtle details. We cannot change this behavior though.

You could change your code to work around this Bionic behavior but then your code will break if it's changed in Bionic. I would expect that to only happen with a new target API level but they could decide to do it with a new major yearly release for all apps instead. I'd expect too much breaks to do it without the new target API level.

At the moment I am satisfied that I understand what I am seeing and I have a workaround for it. I am grateful that we caught this problem. Again, it only affects code that is experiencing an actual stack overflow, causing a SIGSEGV instead of gracefully throwing a JS exception. This is not great, but that code was already broken.

It is quite possible the app we encountered depends on catching a stack overflow. Only 1 app has been reported as broken so far and it's likely much more would have broken if it caused problems outside of extreme edge cases. React Native is extremely popular but 1 user has reported 1 app stopped working after the app was updated, and it's an obscure app.

@thestinger
Copy link
Author

@tmikov It does appear to be the case that this app triggers a stack overflow and normally recovers from it, but with the guard size raised to 64k it can't recover because the stack overflow check is misaligned with what Bionic actually does and allows it to hit the guard.

@thestinger
Copy link
Author

It looks a lot like the developers of this app worked around a bug by catching an exception and simply not fixing it. That's why it's the only app which broke. Some of the stuff we find with our security hardening is truly strange. Still not as weird as how tons of banking apps use the 32-bit ARM syscall ABI perhaps by accident on pure 64-bit devices as part of their anti-tampering frameworks.

@thestinger
Copy link
Author

Android's implementation does not appear to violate the POSIX requirements. It requires that the address range is readable and writable for setstack, not getstack. A way to work around this without hard-wiring a specific Bionic behavior would be to configure the thread with a custom mapping with a 64k guard on both ends for arm64 and 4k for 32-bit arm so that you control the layout and can make assumptions about it. I don't think there's any hope of convincing Google to change Bionic since what they're doing is not explicitly forbidden.

tmikov pushed a commit to tmikov/hermes that referenced this issue Oct 14, 2024
Summary:
Related to: facebook#1535

It turns out that on Android/Bionic `pthread_attr_getstack()` includes
the stack guard pages in the returned size. Whether that is a bug or not
is debatable, but it differs in behavior from Linux/glibc. Plus,
returning a stack range that is not actually not all accessible is less
than useful.

In practice it worked with our approach because the default stack guard
size is 4KB on Android, while our "native gap" is 64KB. Our gap is
larger than the stack guard size, so we would still correctly detect and
catch stack overflow.

However, if the stack guard size is 64KB (or larger), we would not
detect the overflow and crash. It is possible to manually set a larger
guard size, plus security-oriented OS-es like GrapheneOS do it by
default.

The fix is to subtract the guard size from the available stack size when
running on Bionic.

Added tests checking the behavior in the main thread, in a default
thread and in a thread with a 64KB guard. Two separate tests are
performed under these circumstances:
- Unbounded recursion checking for native stack overflow
- Manually reading the available stack range to ensure it is all
accessible.

Differential Revision: D64308925
@tmikov
Copy link
Contributor

tmikov commented Oct 14, 2024

Fix in #1538

facebook-github-bot pushed a commit that referenced this issue Oct 17, 2024
Summary:
Pull Request resolved: #1538

Related to: #1535

It turns out that on Android/Bionic `pthread_attr_getstack()` includes
the stack guard pages in the returned size. Whether that is a bug or not
is debatable, but it differs in behavior from Linux/glibc. Plus,
returning a stack range that is not actually not all accessible is less
than useful.

In practice it worked with our approach because the default stack guard
size is 4KB on Android, while our "native gap" is 64KB. Our gap is
larger than the stack guard size, so we would still correctly detect and
catch stack overflow.

However, if the stack guard size is 64KB (or larger), we would not
detect the overflow and crash. It is possible to manually set a larger
guard size, plus security-oriented OS-es like GrapheneOS do it by
default.

The fix is to subtract the guard size from the available stack size when
running on Bionic.

Added tests checking the behavior in the main thread, in a default
thread and in a thread with a 64KB guard. Two separate tests are
performed under these circumstances:
- Unbounded recursion checking for native stack overflow
- Manually reading the available stack range to ensure it is all
accessible.

Reviewed By: avp

Differential Revision: D64308925

fbshipit-source-id: cdc5e8df6dc0aefd2735d67384d9a7728151d75c
@tmikov tmikov added bug Something isn't working and removed need more info Awating additional info before proceeding labels Oct 18, 2024
facebook-github-bot pushed a commit that referenced this issue Nov 9, 2024
Summary:
Ported from Hermes D64308925.
Related to: #1535

It turns out that on Android/Bionic `pthread_attr_getstack()` includes
the stack guard pages in the returned size. Whether that is a bug or not
is debatable, but it differs in behavior from Linux/glibc. Plus,
returning a stack range that is not actually not all accessible is less
than useful.

In practice it worked with our approach because the default stack guard
size is 4KB on Android, while our "native gap" is 64KB. Our gap is
larger than the stack guard size, so we would still correctly detect and
catch stack overflow.

However, if the stack guard size is 64KB (or larger), we would not
detect the overflow and crash. It is possible to manually set a larger
guard size, plus security-oriented OS-es like GrapheneOS do it by
default.

The fix is to subtract the guard size from the available stack size when
running on Bionic.

Added tests checking the behavior in the main thread, in a default
thread and in a thread with a 64KB guard. Two separate tests are
performed under these circumstances:
- Unbounded recursion checking for native stack overflow
- Manually reading the available stack range to ensure it is all
accessible.

Reviewed By: avp

Differential Revision: D64309284

fbshipit-source-id: 83b881fbb336cbab5bfde2520eae68d0ca4dacde
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants