Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM EABI: Exception handling broken #489

Closed
dnadlinger opened this issue Oct 6, 2013 · 45 comments · Fixed by ldc-developers/druntime#51
Closed

ARM EABI: Exception handling broken #489

dnadlinger opened this issue Oct 6, 2013 · 45 comments · Fixed by ldc-developers/druntime#51
Labels

Comments

@dnadlinger
Copy link
Member

LDC now compiles a Phobos "Hello World", but any attempts of throwing an exception lead to _Unwind_RaiseException failing with error code 9 (generic error). We need to fix this ASAP as it also makes running the test suite really hard.

@dnadlinger
Copy link
Member Author

Turns out the libunwind prototypes are different on ARM: https://gist.github.com/klickverbot/7124007 (that patch is literally a snapshot from during a hacking session, we might want to properly separate out the implementations in the end). Still crashes, though, this time in _Unwind_GetLanguageSpecificData. The next step would be to dump the offending EH tables and figure out what is going wrong, I suppose.

@dnadlinger
Copy link
Member Author

Oh, and I'd highly recommend everybody intending to work on this to get a debug version of the GCC libraries (package might be called something along the lines of gcc-libs), being able to step into the libunwind calls saved my sanity a few times.

@dnadlinger
Copy link
Member Author

After having had a look at the G++ personality function, the code now (updated Gist) gets to the unwinding phase. The exception struct is propagated along correctly, but the catch classinfo lookup is still broken.

@dnadlinger
Copy link
Member Author

Updated Gist again – now with working catch classinfo lookup. Simple EH examples work now, although the test suite reveals that there is still at least one big problem with the implementation that needs to be tracked down.

@redstar
Copy link
Member

redstar commented Jan 7, 2014

Floating point handling is currently broken. There is an assertion error in PortInitializer::PortInitializer while initializing Port::ldbl_nan. ARM uses only double and the provided pattern is therefore wrong. This causes a lot of failures.
I also get a linker error if I use -g, which causes all tests with debug information to fail.
(I have my new ARM board now ready and do a nightly LLVM and LDC build :-) )

@dnadlinger
Copy link
Member Author

Yep, seems reasonable – I just used a release build for testing, as I was only hacking on the runtime anyway.

Can you acutually build the Phobos test in a reasonable amount of time? std.algorithm and std.datetime cause my ARM system to swap, even if the system itself, apart from LDC, only takes ~50 MB of RAM. -.-

@redstar
Copy link
Member

redstar commented Jan 7, 2014

I can time it. I run a script executed by cron and simply mail the output. But first the floating point failure must be fixed - only 17% of the tests passes right now (merge-2.064 branch).

@redstar
Copy link
Member

redstar commented Jan 9, 2014

The test suite runs roughly 16h. (Compile time for LDC is much less then an hour.) Only 68% pass.

@dnadlinger
Copy link
Member Author

Eww, that's awfully long. Is your system swapping like crazy or something like that? From my vague recollection of the specs of your system and the performance on my ODROID-X, I would have maybe guessed a few hours or so, but definitely not 16!

@redstar
Copy link
Member

redstar commented Jan 9, 2014

I have to check. Previously, only 17% of the tests passed. But runtime of the test suite was only 20min.

The system has 2GB memory and no swap space.

@redstar
Copy link
Member

redstar commented Jan 20, 2014

I applied the gist patch yesterday. The tests seem to look better. I have the result hopefully this evening. (I currently think that the long runtime is caused by power management. I have to investigate this closer...)

I also identified the problem why applications with debug info do not link. (http://llvm.org/bugs/show_bug.cgi?id=18554) I already committed the first part of fixing this.

@dnadlinger
Copy link
Member Author

@redstar: Yes, without the patch pretty much every testcase containing EH at all (also e.g. assertThrown) fails, whereas the patched personality function at least lets us get the simpler cases right. The remaining problems are still responsible for a fair share of the test suite failures, but I don't really have time to look into that right now.

@andralex
Copy link
Contributor

redstar added a commit that referenced this issue Jun 27, 2014
Includes some of the EH changes for ARM from issue #489.
Adds thread support for MIPS64.
redstar pushed a commit that referenced this issue Sep 27, 2014
Correct use of Linux version identifier in epoll interface
redstar added a commit that referenced this issue Sep 27, 2014
This is a copy of the easy part of issue #489.
redstar added a commit to redstar/ldc that referenced this issue Nov 5, 2014
kinke pushed a commit to kinke/ldc that referenced this issue Nov 9, 2014
@joakim-noah
Copy link
Contributor

Could the remaining EH test failures be related to function inlining? I turned off function inlining on Android/ARM and got a bunch more EH-related tests to pass. Could one of you verify this on linux/ARM?

@joakim-noah
Copy link
Contributor

I looked into this recently, especially the ARM load/store optimization pass I mentioned in the above linked thread. That backend pass would often cause a segfault in eh_personality_common right before get_uleb128 is called, because of the assignments of block_size and landing_pad right before.

Since callsite_walker is a byte array, it returns addresses that are not necessarily word-aligned. That works fine when that ARM optimization pass isn't run and they're converted to LDR instructions, but the optimization pass tries to turn the two assignments into a single LDMIB, which will fail if the callsite_walker address happens not to be aligned to a 4-byte word. I'm guessing this information about word alignment is lost to the backend because of the *cast(uint*) employed for all the assignments.

Turning off that ARM optimization pass fixed the EH problem up till ldc commit e9e8e4654d5 from a couple months ago, but it appears the EH rewrites since then have broken something else on ARM also, which I haven't tracked down. As noted in the above thread, inlining is not really the issue: turning on the debug statements and turning off inlining merely tickled the IR enough that this optimization wasn't being done, ie it was a hack that happened to work.

@joakim-noah
Copy link
Contributor

I looked into what else was causing problems with EH on ARM since ldc commit e9e8e4654d5, turns out it was this check skipCatchComparison() before comparing exception types, that was since added. If I comment out that check and compile libunwind.d with the above ARM optimization pass turned off, ARM EH starts working for me on ldc master. The problem appears to be that HANDLER_FRAME is never set in the ARM _d_eh_personality function because ucb.barrier_cache.sp always returns zero. Not sure why as I didn't look further, but I did notice that some of the types are off, ie _uw is aliased to ptrdiff_t but is uint in the C headers, could be something like that.

The new test for exception chaining across fiber switches asserts, but other than that everything works on Android/ARM. Let me know if it's the same on linux/ARM.

@joakim-noah
Copy link
Contributor

I'm still seeing the other issue with the check before comparing exception types mentioned in my last comment, @smolt, that one not hitting you?

@smolt
Copy link
Member

smolt commented Jan 7, 2016

No, that one is not hitting, but watchOS does not use ARM_EABI_UNWINDER path either. And I am not familiar with that path. I have a Raspberry Pi laying around that I should try it on.

I do have one EH catch problem remaining that I wrote up in the "Watch Out" forum announcement: the stack pointer not corrected in landing pads after a function call with stack adjustments. It appears to be an LLVM problem and I have my workaround in forked LLVM to restore stack from frame pointer. I was going to eventually write that up as a separate issue once I was sure it wasn't watchOS specific as there is no watchOS code in LDC yet. It ain't a real issue if no one can exercise it.

@dnadlinger dnadlinger reopened this Jan 7, 2016
@dnadlinger
Copy link
Member Author

I didn't mean to close this issue, apparently the words "resolve: ldc-developers/druntime#51" in the PR text were enough for GitHub to auto-close this.

I originally opened this issue working on arm-linux-gnueabihf, and I don't think anybody has fixed (or tested) that lately.

@dnadlinger dnadlinger changed the title ARM: Exception handling broken ARM EABI: Exception handling broken Jan 7, 2016
@smolt
Copy link
Member

smolt commented Jan 7, 2016

I think the original Pi I have is arm-linux-gnueabihf. I'll haven't played with it in a year but might be fun to cross-compile to it. I don't think it has the horse power to build everything. Last time I tried building a recent version of GDC on an older 800Mhz computer it took over 24 hours.

@dnadlinger
Copy link
Member Author

I've built LDC on an ODROID-X (4x1.4 GHz Cortex-A9) in a couple of minutes. The biggest issue tends to be the slow I/O if you are using SD cards. For an RPi 1, I'd probably use a cross-compiler for my own sanity though.

@smolt
Copy link
Member

smolt commented Jan 31, 2016

Glad to say I have Pi with LDC master going.

I'm still seeing the other issue with the check before comparing exception types mentioned in my last comment, @smolt, that one not hitting you?

Yes - Pi has this issue too - as well as a bunch of others! As you reported, just commenting out the line:

        if (!nativeContext.skipCatchComparison())

seems to fix it. Now to figure out why.

@smolt
Copy link
Member

smolt commented Feb 1, 2016

Back to the puzzle. @joakim-noah as you noticed:

The problem appears to be that HANDLER_FRAME is never set in the ARM _d_eh_personality function because ucb.barrier_cache.sp always returns zero.

It seems it is the personality's responsibility to set sp and something like the following is missing from ldc implementation.

https://github.com/gcc-mirror/gcc/blob/bd3f0a53c07086e978ea4ec00e10a20e4a969c40/libobjc/exception.c#L467

I am trying to run entire test suite on Pi, see if it can do it now, but it is very S L O W. I like it.

@smolt
Copy link
Member

smolt commented Feb 1, 2016

Test suite run from Pi last night didn't do very well. I have a collection of changes, plus now a list of failures to sift through. I'll make a PR later this week for EH once I confirm it isn't part of the failures, and will also start a separate issue for getting ldc into shape on Raspberry Pi.

@redstar
Copy link
Member

redstar commented Feb 1, 2016

I reactivated my IFC6410. It is a Krait-based ARM board, using triple armv7a-hardfloat-linux-gnueabi.
The test suite is badly broken, see the output here: https://gist.github.com/redstar/f4fbc682e03471064fbe
Runtime of test suite is 2h30min.

@smolt
Copy link
Member

smolt commented Feb 1, 2016

Yeah, looks familiar. All the debug failures probably are due to the extra ldr instruction at -O0. I wrote something on that in the dlang LDC forum.

Sent from my iPhone

On Feb 1, 2016, at 11:57 AM, Kai Nacke [email protected] wrote:

I reactivated my IFC6410. It is a Krait-based ARM board, using triple armv7a-hardfloat-linux-gnueabi.
The test suite is badly broken, see the output here: https://gist.github.com/redstar/f4fbc682e03471064fbe
Runtime of test suite is 2h30min.


Reply to this email directly or view it on GitHub.

@redstar
Copy link
Member

redstar commented Feb 1, 2016

Yes, I read that. Good analysis.

@smolt
Copy link
Member

smolt commented Feb 2, 2016

Exceptions seem ok now. Should we start a separate issue or issues to work through remaining ARM gnueabi problems? Next thing I will work on is the C ABI. I have a more exhaustive test I developed for iOS that found many problems on Pi when calling or being called from C functions. In the end we should have an abi-arm.cpp and I'll put my additional tests in the test suite.

edit - not saying exception handling is totally done. Some of the remaining test suite or unittest failures may be related, just haven't looked close yet.

@dnadlinger
Copy link
Member Author

In the end we should have an abi-arm.cpp

Definitely! abi.cpp is just intended as a no-op default implementation to use on a new platform until you discover which parts are broken. ;)

and I'll put my additional tests in the test suite.

Please do. Improved C ABI coverage is definitely very helpful for bringing up LDC on a new platform.

@smolt
Copy link
Member

smolt commented Feb 4, 2016

Following the AAPCS made a huge difference! I already had something close for watchOS which is AAPCS-16. I am working on cleaning it up for a proper PR, hope to have it soon. The hacked up stuff I ran tests on is in my smolt:pi play branch.

Look at the Total Test time. Yes, it really takes that long to run tests on a Pi, partially because it only has 512 MB and it often needs to use swap which I have setup on a USB 1 TB HD. Just linking phobos test runner takes 1.5 hours.

99% tests passed, 9 tests failed out of 673

Total Test time (real) = 67992.27 sec

The following tests FAILED:
    234 - std.exception (OTHER_FAULT)
    242 - std.datetime (OTHER_FAULT)
    263 - std.random (Failed)
    307 - std.internal.math.errorfunction (Failed)
    572 - std.datetime-debug (OTHER_FAULT)
    593 - std.random-debug (Failed)
    670 - dmd-testsuite-debug (Failed)
    672 - dmd-testsuite (Failed)
    673 - llvm-ir-testsuite (Failed)

@joakim-noah
Copy link
Contributor

Following the AAPCS made a huge difference! I already had something close for watchOS which is AAPCS-16. I am working on cleaning it up for a proper PR, hope to have it soon. The hacked up stuff I ran tests on is in my smolt:pi play branch.

Great. :)

Look at the Total Test time. Yes, it really takes that long to run tests on a Pi, partially because it only has 512 MB and it often needs to use swap which I have setup on a USB 1 TB HD. Just linking phobos test runner takes 1.5 hours.

Couldn't you cross-compile, as I did a couple summers ago, and cross-link, save some time that way? You'd lose some of the automation from the build scripts, but it's not hard to run the test runner yourself.

99% tests passed, 9 tests failed out of 673

Total Test time (real) = 67992.27 sec

The following tests FAILED:
    234 - std.exception (OTHER_FAULT)
    242 - std.datetime (OTHER_FAULT)
    263 - std.random (Failed)
    307 - std.internal.math.errorfunction (Failed)
    572 - std.datetime-debug (OTHER_FAULT)
    593 - std.random-debug (Failed)
    670 - dmd-testsuite-debug (Failed)
    672 - dmd-testsuite (Failed)
    673 - llvm-ir-testsuite (Failed)

Could the std.exception, std.random, and std.internal.math.errorfunction failures be related to the optimization issues I saw on Android too? Those flags worked around those issues for me.

@smolt
Copy link
Member

smolt commented Feb 4, 2016

Couldn't you cross-compile

Yeah, but in this case I get dmd test suite run too - which has 33 failures right now. It's not so bad, it only takes long to run and rebuild everything and then I am off doing other stuff. Compile/link/run for a single test program is on order of tens of seconds.

Could the std.exception, std.random, and std.internal.math.errorfunction failures be related to the optimization issues I saw on Android too? Those flags worked around those issues for me.

Might be. I'll try what you did first to see if problem is related. But then I'd like to find cause and fix that.

I am curious. Is Android following AAPCS? Did you use something beyond default abi.cpp? I don't recall you mentioning that, yet using the default abi.cpp and core.std.varags doesn't work well for anything I've done on ARM so far: iOS - APCS, watchOS - AAPCS-16, and now arm-linux-gnueabihf - AAPCS.

@joakim-noah
Copy link
Contributor

Yeah, but in this case I get dmd test suite run too - which has 33 failures right now.

That's one advantage of natively compiling on Android/ARM too: I should be able to get the compiler testsuite running easily, though I haven't tried it yet.

It's not so bad, it only takes long to run and rebuild everything and then I am off doing other stuff. Compile/link/run for a single test program is on order of tens of seconds.

Heh, "tens of seconds." I'd have thought the USB HD would better that, but I guess the CPU in the RPi is pretty weak too.

I am curious. Is Android following AAPCS? Did you use something beyond default abi.cpp? I don't recall you mentioning that, yet using the default abi.cpp and core.std.varags doesn't work well for anything I've done on ARM so far: iOS - APCS, watchOS - AAPCS-16, and now arm-linux-gnueabihf - AAPCS

I don't know if it's AAPCS. :) I use the default abi.cpp with the struct passing change you suggested, that's it. There are definitely some issues with varargs, which I have noticed with small test programs I wrote, but not in the druntime/phobos tests. Maybe that's tested by the dmd test suite, which I haven't run on Android yet.

@smolt
Copy link
Member

smolt commented Feb 4, 2016

I don't know if it's AAPCS. :) I use the default abi.cpp with the struct passing change you suggested, that's it.

Oh, forgot about how using "byval" arg passing with ARM can produce bad instruction sequences. AAPCS does not use "byval", so that is probably why most of the tests passed. However, full AAPCS compatibility allows structs passing/returns to be compatible with C, and may fix Android vararg problem. There is some special treatment for 8-byte aligned types and a couple other things. I'll get it out in next few days so you can try before it is pulled.

@joakim-noah
Copy link
Contributor

OK, I'll try it.

@smolt
Copy link
Member

smolt commented Feb 4, 2016

Although it is messy code, you could try this: 4e6778b together with ldc-developers/druntime#58 on Android. Or you could wait until I cleanup the code and make PR. It mixes iOS APCS and watchOS AAPCS-16 ABI code and has lots of vestigial parts. It does seem to work well with C ABI on ARM though.

@dnadlinger
Copy link
Member Author

19 hours?! At that rate, wouldn't it even be faster to just run it in qemu?

@smolt
Copy link
Member

smolt commented Feb 4, 2016

On Feb 4, 2016, at 9:25 AM, David Nadlinger [email protected] wrote:

19 hours?! At that rate, wouldn't it even be faster to just run it in qemu?

Maybe, but long builds aren't a problem because I am away most of day. I spend a couple hours before sleep, kick off full build, then I am not back at until 22 hrs later.

@joakim-noah
Copy link
Contributor

I tried out the commits you suggested. The good news is all druntime/phobos tests still pass on Android/ARM, the bad news is that I don't have my vararg test case anymore and can't seem to reproduce it. So hopefully you fixed the issue, but I won't know for a bit.

@smolt
Copy link
Member

smolt commented Feb 6, 2016

When I get new C ABI test in test suite you can try that. It has some of vararg tests.

@smolt
Copy link
Member

smolt commented Feb 27, 2016

Heads up that this may be closed soon. Just need ldc-developers/druntime#62 in AND someone besides me to verify it fixes all exception handling failures in the test-suite.

@joakim-noah
Copy link
Contributor

Now that that druntime pull is in and partially verified, I think this can be closed.

There is a bounty on this issue: @smolt should get it, if it's still available on bountysource.

@smolt
Copy link
Member

smolt commented Mar 5, 2016

If there is a bounty, I only worked on the last little pieces. David wrote most of the code I think. If no wishes to claim it, I say donate it to the D Foundation.

@joakim-noah
Copy link
Contributor

@klickverbot, you can claim the bounty and close this issue, since you wrote most of the EH code, or @andralex can transfer it to the D Foundation (or who knows? Maybe FB won't pay it out anymore and we can just close the issue 🎱).

@smolt
Copy link
Member

smolt commented Mar 25, 2016

I hope bounty discussion isn't keeping this open? I think we should advertise that exception handling is good by closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants