Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[build] Mass-update JLL versions to get codesigned libraries on Darwin #39719

Merged
merged 3 commits into from
Feb 24, 2021

Conversation

staticfloat
Copy link
Member

On Apple Silicon, we need ad-hoc code signatures on all libraries. This
mass rebuild of the dependencies provides signed versions of all
libraries on both x86_64-apple-darwin and aarch64-apple-darwin

@omus
Copy link
Member

omus commented Feb 18, 2021

I'll need to look over all of the changes to the JLLs but the changes here look good

@staticfloat
Copy link
Member Author

It's still failing tests so there's likely something wrong. I will try to fix it today.

@staticfloat
Copy link
Member Author

There's an issue with OpenBLAS and FreeBSD/MacOS. For some reason, when OpenBLAS tries to initialize itself to the HASWELL target and it tries to call gotoblas->init() here, it jumps to the wrong function. Instead of calling init, it instead calls ztrsm_olnncopy

I have identified the issue as being triggered by the recent change to set TARGET=HASWELL to set a default architecture, but I'm not sure if this is something that is illegal in the OpenBLAS world or just unsupported on 0.3.10. Investigating.

staticfloat added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Feb 19, 2021
Turns out that OpenBLAS 0.3.12 and earlier have some kind of
`clang`-related code generation bug when built like this.

The issue is that the call to `gotoblas->init()` here [0] does not
actually call `init` but instead calls `ztrsm_olnncopy`, presumably
because of some struct layout issue.  I briefly looked into it, but
since it only effects `clang` and doesn't effect `0.3.13`, I'm taking
the coward's way out and just not poking the compiler bear.

[0]
https://github.com/xianyi/OpenBLAS/blob/v0.3.10/driver/others/dynamic.c#L921

X-ref: JuliaLang/julia#39719
staticfloat added a commit to JuliaPackaging/Yggdrasil that referenced this pull request Feb 19, 2021
…2598)

Turns out that OpenBLAS 0.3.12 and earlier have some kind of
`clang`-related code generation bug when built like this.

The issue is that the call to `gotoblas->init()` here [0] does not
actually call `init` but instead calls `ztrsm_olnncopy`, presumably
because of some struct layout issue.  I briefly looked into it, but
since it only effects `clang` and doesn't effect `0.3.13`, I'm taking
the coward's way out and just not poking the compiler bear.

[0]
https://github.com/xianyi/OpenBLAS/blob/v0.3.10/driver/others/dynamic.c#L921

X-ref: JuliaLang/julia#39719
@staticfloat
Copy link
Member Author

Looks like that was the only issue!

@staticfloat staticfloat added the backport 1.6 Change should be backported to release-1.6 label Feb 20, 2021
Copy link
Member

@omus omus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there's an empty file deps/checksums/llvm-tools

deps/checksums/unwind Show resolved Hide resolved
@staticfloat staticfloat force-pushed the sf/codesigned_libraries branch 2 times, most recently from bc376db to 0aea071 Compare February 20, 2021 03:52
@staticfloat
Copy link
Member Author

@Keno building this branch on Apple Silicon (with this [0] patch applied) ends up taking this branch and I get the following error during bootstrap:

LoadError(at "sysimg.jl" line 3: LoadError(at "Base.jl" line 206: LoadError(at "shell.jl" line 255: LoadError(at "shell.jl" line 335: ErrorException("PCRE JIT error: no more memory")))))

Could this be indicative that our PCRE patch is insufficient? I hear that this does not happen to @omus.

@Keno
Copy link
Member

Keno commented Feb 20, 2021

Could there me some entitlement thing for JIT permissions that gets turned on when you codesign it?

@staticfloat
Copy link
Member Author

I checked for entitlements, but it claims this library has none:

$ codesign -d --entitlements :- usr/lib/libpcre2-8.dylib 
Executable=/Users/sabae/src/julia/usr/lib/libpcre2-8.0.dylib
usr/lib/libpcre2-8.dylib: no signature
$ codesign -dv --verbose=4 usr/lib/libpcre2-8.dylib 
Executable=/Users/sabae/src/julia/usr/lib/libpcre2-8.0.dylib
Identifier=libpcre2-8.0.dylib
Format=Mach-O thin (arm64)
CodeDirectory v=20400 size=5803 flags=0x0(none) hashes=176+2 location=embedded
VersionPlatform=1
VersionMin=720896
VersionSDK=1310720
Hash type=sha256 size=32
CandidateCDHash sha1=a4fb402562f1a040c845c2b19e48809259f8a2c3
CandidateCDHashFull sha1=a4fb402562f1a040c845c2b19e48809259f8a2c3
CandidateCDHash sha256=502242e818d13c3389de129b52b42277794c5494
CandidateCDHashFull sha256=502242e818d13c3389de129b52b42277794c5494c386cfac348323d5f98354e0
Hash choices=sha1,sha256
CMSDigest=5488537125f142fc4ec0afa3e7bf6808e84650a5b89bd4e3224599be1b09a686
CMSDigestType=2
Page size=4096
CDHash=502242e818d13c3389de129b52b42277794c5494
usr/lib/libpcre2-8.dylib: no signature
Info.plist=not bound
TeamIdentifier=not set
Sealed Resources=none
Internal requirements count=1 size=136

@Keno
Copy link
Member

Keno commented Feb 20, 2021

Right, I'm suggesting it might need them if codesign is turned on

@staticfloat
Copy link
Member Author

I don't think so; these things aren't codesigned in the normal way, this is just the signing that the linker does to every executable. I don't have any entitlements, and I haven't opted into the hardened runtime. I can work around it by disabling the PCRE2 JIT but that's the only way. Compiling locally has the same problem.

@Keno
Copy link
Member

Keno commented Feb 20, 2021

Well, the error suggests that a MAP_JIT mmap was rejected, which Darwin does like to do. No idea why it would happen if the hardened runtime isn't used though.

@KristofferC
Copy link
Member

This seems a bit big to backport and the discussion about the various problems make me think we should put this at the earliest into 1.6.1. 1.6.0 is really at a crucial bugfix only stage at this point.

@KristofferC KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Feb 20, 2021
@staticfloat
Copy link
Member Author

I agree with Kristoffer. We’ll have a special 1.6.1-alpha for aarch64-Darwin after 1.6.0 final is tagged.

@omus
Copy link
Member

omus commented Feb 21, 2021

I hear that this does not happen to @omus.

I disabled SIP on my system a while ago to get lldb working (I forgot about this). I bet that enabling it again will cause the PCRE JIT issue

@omus
Copy link
Member

omus commented Feb 21, 2021

Seems like re-enabling SIP does still not cause the PCRE JIT issue.

@staticfloat
Copy link
Member Author

staticfloat commented Feb 21, 2021

Well, the error suggests that a MAP_JIT mmap was rejected, which Darwin does like to do. No idea why it would happen if the hardened runtime isn't used though.

Building a debug pcre2 and stepping through with lldb, it appears that the the mmap() works fine, but the mprotect doesn't. The relevant code (after patching and removing dead #ifdef branches) is:

static SLJIT_INLINE void* alloc_chunk(sljit_uw size)
{
        void *retval;
        const int prot = PROT_READ | PROT_WRITE | PROT_EXEC;

        int flags = MAP_PRIVATE | MAP_ANON | SLJIT_MAP_JIT;

        retval = mmap(NULL, size, prot, flags, -1, 0);
        if (retval == MAP_FAILED)
                return NULL;

        if (mprotect(retval, size, prot) < 0) {
                munmap(retval, size);
                return NULL;
        }

        SLJIT_UPDATE_WX_FLAGS(retval, (uint8_t *)retval + size, 0);

        return retval;
}

We get to the mprotect() call, with exemplar values:

Process 97774 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x0000000106e4e8ec libpcre2-8.0.dylib`alloc_chunk(size=65536) at sljitExecAllocator.c:185:15
   182          if (retval == MAP_FAILED)
   183                  return NULL;
   184 
-> 185          if (mprotect(retval, size, prot) < 0) {
   186                  munmap(retval, size);
   187                  return NULL;
   188          }
Target 0: (julia) stopped.
(lldb) p retval
(void *) $17 = 0x0000000104f84000
(lldb) p size
(sljit_uw) $19 = 65536
(lldb) p prot
(const int) $20 = 7

mprotect then returns -1 and sets errno to EACCESS. Digging into XNU, this happens when mach_vm_protect() returns KERN_PROTECTION_FAILURE, and that would be happening here. But this is the end of my journey; I'm not sure why one of these branches are failing (or even if this is what's happening, since I can't debug the kernel).

Just to be sure, I re-codesigned everything with explicit JIT entitlements:

$ codesign --force --entitlements ./contrib/mac/app/Entitlements.plist --sign - ./usr/lib/*.dylib ./usr/bin/julia
$ codesign -d --entitlements :- ./usr/bin/julia 
Executable=/Users/sabae/src/julia/usr/bin/julia
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>com.apple.security.automation.apple-events</key>
        <true/>
        <key>com.apple.security.cs.get-task-allow</key>
        <true/>
        <key>com.apple.security.cs.allow-dyld-environment-variables</key>
        <true/>
        <key>com.apple.security.cs.allow-jit</key>
        <true/>
        <key>com.apple.security.cs.allow-unsigned-executable-memory</key>
        <true/>
        <key>com.apple.security.cs.debugger</key>
        <true/>
        <key>com.apple.security.cs.disable-library-validation</key>
        <true/>
        <key>com.apple.security.device.audio-input</key>
        <true/>
        <key>com.apple.security.device.camera</key>
        <true/>
</dict>
</plist>

But this doesn't improve anything; I still get the same error.

@staticfloat
Copy link
Member Author

Aha; looks like the issue is that I'm running MacOS 11.2: zherczeg/sljit#99

@staticfloat
Copy link
Member Author

Okay so the fundamental issue is that it's illegal to mprotect() a page to rwx on aarch64 and the fact that it worked before 11.2 was a bug, according to Apple in this comment. I guess SLJIT needs a new patch to no longer expect that rwx is a possibility and to instead flip-flop between rw- and r-x correctly. A quick test of setting that initial prot to either r-x or rw- yields segfaults later on in pcre2_match_8, so it's not just that the test is broken, the actual usage is broken somehow too. :(

@Keno
Copy link
Member

Keno commented Feb 22, 2021

Looks like there's already a patch for this: zherczeg/sljit#105

@omus omus mentioned this pull request Feb 22, 2021
31 tasks
@staticfloat staticfloat force-pushed the sf/codesigned_libraries branch 2 times, most recently from bca401c to aa8ded0 Compare February 22, 2021 19:20
@Keno
Copy link
Member

Keno commented Feb 22, 2021

Looks like something is wrong with the openblas aarch64 build?

@staticfloat
Copy link
Member Author

This now builds cleanly on Apple silicon for me, with the exception of the patch needed to comment out the WRITE_FAULT branch in src/signals-mach.c.

@staticfloat
Copy link
Member Author

I plan to merge this when green.

On Apple Silicon, we need ad-hoc code signatures on all libraries.  This
mass rebuild of the dependencies provides signed versions of all
libraries on both `x86_64-apple-darwin` and `aarch64-apple-darwin`
@omus
Copy link
Member

omus commented Feb 24, 2021

Interesting failure. Looks unrelated

@Keno
Copy link
Member

Keno commented Feb 24, 2021

CI had a hickup, I've restarted it.

@staticfloat staticfloat merged commit e32e94e into master Feb 24, 2021
@staticfloat staticfloat deleted the sf/codesigned_libraries branch February 24, 2021 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants