Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD code selection and configury is incorrect for multi-arch compilation #76

Closed
jmarshall opened this issue Mar 9, 2023 · 9 comments · Fixed by #78
Closed

SIMD code selection and configury is incorrect for multi-arch compilation #76

jmarshall opened this issue Mar 9, 2023 · 9 comments · Fixed by #78

Comments

@jmarshall
Copy link
Member

[This investigation has occurred in a pysam context, which uses htslib's bundled-htscodecs build system rather than htscodecs's native build system. So this issue might equally properly be raised in htslib, but I expect any fix will span both htscodecs and htslib anyway.]

It seems that on modern macOS the usual way to build universal object files and executables is to compile and link with -arch x86_64 -arch arm64. This runs the compiler twice, once for each target, and then runs lipo to combine the two temporary .o files into a single multi-arch (‘universal’) .o file.

This does not interact well with htslib's configure.ac SIMD probes (and presumably something similar happens with htscodecs's configure.ac):

# CFLAGS=[…] -arch arm64 -arch x86_64
[…]
checking whether C compiler accepts -mssse3 -mpopcnt -msse4.1... no
checking whether C compiler accepts -mavx2... no
checking whether C compiler accepts -mavx512f... no
checking whether C compiler supports ARM Neon... no

Looking in config.log we see that the underlying x86_64 compilation has accepted -mssse3 etc but the underlying arm64 compilation has produced errors, so the overall compilation fails too — resulting in no. And vice versa for the Neon probe.

This results in rANS_static32x16pr_neon.o et al being omitted from the build, but the *_neon() invocations in rANS_static4x16pr.c are included independent of the configure answer. This in turn leads to linking failure or for shared objects a failure to find rans_compress_O0_32x16_neon at runtime, as encountered in pysam-developers/pysam#1149.

It would be simple enough to fix the *_neon() invocation guard to respect the no answer and avoid this crash. However this pessimizes execution time for this multi-arch build by omitting all the SIMD implementations — really for this universal ARM/X86-64 build the answer to all four checks should be yes, with the 3 x86 SIMD implementations being used for the underlying x86_64 build and the Neon implementation being used for the underlying ARM build.

@jmarshall
Copy link
Member Author

Over in pysam I have been experimenting with fixing this as follows:

  1. In htslib/configure.ac, wrap #if defined __i386__ || defined __x86_64__/#endif around both the #include "x86intrin.h" and the _mm_*() code so that the underlying ARM compilation won't break the probe test compilation and so the check says yes for multi-arch builds. (And similarly with #if defined __arm__ || defined __aarch64__ for the Neon probe.)
  2. In htscodecs/rANS_static32x16pr_*.o wrap the whole files in #if defined __i386__ || defined __x86_64__/#endif similarly.

This pretty much fixes the issue for this multi-arch build. However it also means that e.g. a SPARC build will also say yes to all four probes. Ideally there would be a way for configure.ac to detect this multi-arch scenario and only be generous if there exists a x86 (respectively ARM) target in the mix, but this is non-trivial.

Another possibility would be to have configure.ac grep $CFLAGS for -arch options…

@daviesrob
Copy link
Member

That's annoying. It reminds me of the old x86-64/PPC fat-binary days which had much the same problem.

@jmarshall
Copy link
Member Author

Annoying indeed.

Assuming the linker can deal with a mixture of universal and single-arch object files, maybe this is a better approach:

  • Write m4/shell functionality to filter out -arch xyz options.

  • Use it to build up CFLAGS_FOR_X86 and CFLAGS_FOR_ARM. Default them to $(CFLAGS) in the non-macOS or non-multiarch cases.

  • Use these instead of CFLAGS for the configure probes and SIMD implementation translation unit rules.

@jkbonfield
Copy link
Collaborator

@daviesrob Given I don't have a Mac and our local policy of not providing a viable remote login to a Mac for software development (despite repeated requests), this is simply something I am unable to experiment with and verify.

Is it something you could test? Either that or I can start exploring AWS. I could only find x86 macs, but maybe the arm ones are in another AWS region as I'm sure they do exist somewhere. Either that or helpdesk tickets to get loan machines (and probably many more to get them to install all the required software for me).

@daviesrob
Copy link
Member

We have plenty of suitable macs as a group. Ideally I'd like to avoid extracting -arch options as it may be fragile. It may also be unavoidable, though. I'll experiment.

@daviesrob
Copy link
Member

Argghhhh ... it's all coming back to me now. These multiarch binaries don't play well with autoconf because configure expects one answer to its questions, but the multiarch compilation tries to give two.

I think wrapping the tests and SIMD usage in #ifdef guards is the solution. I don't think the tests saying yes on platforms like SPARC will actually be a problem, because they're really checking to see if the compiler accepts the option and a SPARC-only one will probably not do that.

We'll also need to get rid of the conditional compilation, which will mean finding another solution for the empty translation unit problem. That shouldn't be too difficult to solve though, apart from the minor annoyance of possibly leaving a few dummy symbols around.

@jmarshall
Copy link
Member Author

FYI I have now tried the “filter out -arch xyz options” approach — see pysam-developers/pysam#1173. This works well enough for pysam's needs and we will probably apply it in the interim until there's an htslib/htscodecs release containing Rob's eventual PR, but it's not great:

  • The static build doesn't like it:
ar -rc libhts.a kfunc.o kstring.o bcf_sr_sort.o bgzf.o errmod.o faidx.o header.o hfile.o hts.o hts_expr.o hts_os.o md5.o multipart.o probaln.o realn.o regidx.o region.o sam.o synced_bcf_reader.o vcf_sweep.o tbx.o textutils.o thread_pool.o vcf.o vcfutils.o cram/cram_codecs.o cram/cram_decode.o cram/cram_encode.o cram/cram_external.o cram/cram_index.o cram/cram_io.o cram/cram_stats.o cram/mFILE.o cram/open_trace_file.o cram/pooled_alloc.o cram/string_alloc.o htscodecs/htscodecs/arith_dynamic.o htscodecs/htscodecs/fqzcomp_qual.o htscodecs/htscodecs/htscodecs.o htscodecs/htscodecs/pack.o htscodecs/htscodecs/rANS_static4x16pr.o htscodecs/htscodecs/rANS_static32x16pr_avx2.o htscodecs/htscodecs/rANS_static32x16pr_avx512.o htscodecs/htscodecs/rANS_static32x16pr_sse4.o htscodecs/htscodecs/rANS_static32x16pr_neon.o htscodecs/htscodecs/rANS_static32x16pr.o htscodecs/htscodecs/rANS_static.o htscodecs/htscodecs/rle.o htscodecs/htscodecs/tokenise_name3.o htscodecs/htscodecs/utils.o   hfile_libcurl.o hfile_gcs.o hfile_s3.o hfile_s3_write.o
ranlib: archive member: libhts.a(rANS_static32x16pr_neon.o) cputype (16777228) does not match previous archive members cputype (16777223) (all members must match)
ranlib libhts.a
ranlib: archive member: libhts.a(rANS_static32x16pr_neon.o) cputype (16777228) does not match previous archive members cputype (16777223) (all members must match)
make: [libhts.a] Error 1 (ignored)

For pysam's purposes we just build this so that all the .o files are built, so this doesn't matter too much for us.

  • The linker produces warnings even when building a shared object:
clang -undefined dynamic_lookup -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX13.sdk -arch arm64 -arch x86_64 -g -Wno-unused-result -Wno-unreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g build/temp.macosx-13-arm64-cpython-310/pysam/htslib_util.o build/temp.macosx-13-arm64-cpython-310/pysam/libchtslib.o htslib/kfunc.o htslib/kstring.o htslib/bcf_sr_sort.o htslib/bgzf.o htslib/errmod.o htslib/faidx.o htslib/header.o htslib/hfile.o htslib/hts.o htslib/hts_expr.o htslib/hts_os.o htslib/md5.o htslib/multipart.o htslib/probaln.o htslib/realn.o htslib/regidx.o htslib/region.o htslib/sam.o htslib/synced_bcf_reader.o htslib/vcf_sweep.o htslib/tbx.o htslib/textutils.o htslib/thread_pool.o htslib/vcf.o htslib/vcfutils.o htslib/cram/cram_codecs.o htslib/cram/cram_decode.o htslib/cram/cram_encode.o htslib/cram/cram_external.o htslib/cram/cram_index.o htslib/cram/cram_io.o htslib/cram/cram_stats.o htslib/cram/mFILE.o htslib/cram/open_trace_file.o htslib/cram/pooled_alloc.o htslib/cram/string_alloc.o htslib/htscodecs/htscodecs/arith_dynamic.o htslib/htscodecs/htscodecs/fqzcomp_qual.o htslib/htscodecs/htscodecs/htscodecs.o htslib/htscodecs/htscodecs/pack.o htslib/htscodecs/htscodecs/rANS_static4x16pr.o htslib/htscodecs/htscodecs/rANS_static32x16pr_avx2.o htslib/htscodecs/htscodecs/rANS_static32x16pr_avx512.o htslib/htscodecs/htscodecs/rANS_static32x16pr_sse4.o htslib/htscodecs/htscodecs/rANS_static32x16pr_neon.o htslib/htscodecs/htscodecs/rANS_static32x16pr.o htslib/htscodecs/htscodecs/rANS_static.o htslib/htscodecs/htscodecs/rle.o htslib/htscodecs/htscodecs/tokenise_name3.o htslib/htscodecs/htscodecs/utils.o htslib/hfile_libcurl.o htslib/hfile_gcs.o htslib/hfile_s3.o htslib/hfile_s3_write.o -L/private/var/folders/fg/b67mbhkx2_nflv_fs566s9080000gn/T/cirrus-ci-build/pysam -L/private/var/folders/fg/b67mbhkx2_nflv_fs566s9080000gn/T/cirrus-ci-build -Lbuild/lib.macosx-13-arm64-cpython-310/pysam -lz -llzma -lbz2 -lz -lcurl -o build/lib.macosx-13-arm64-cpython-310/pysam/libchtslib.cpython-310-darwin.so -dynamiclib -rpath @loader_path -Wl,-headerpad_max_install_names -Wl,-install_name,@rpath/libchtslib.cpython-310-darwin.so -Wl,-x
ld: warning: -undefined dynamic_lookup may not work with chained fixups
ld: warning: ignoring file htslib/htscodecs/htscodecs/rANS_static32x16pr_avx2.o, building for macOS-arm64 but attempting to link with file built for unknown-x86_64
ld: warning: ignoring file htslib/htscodecs/htscodecs/rANS_static32x16pr_sse4.o, building for macOS-arm64 but attempting to link with file built for unknown-x86_64
ld: warning: ignoring file htslib/htscodecs/htscodecs/rANS_static32x16pr_avx512.o, building for macOS-arm64 but attempting to link with file built for unknown-x86_64
ld: warning: -undefined dynamic_lookup may not work with chained fixups
ld: warning: ignoring file htslib/htscodecs/htscodecs/rANS_static32x16pr_neon.o, building for macOS-x86_64 but attempting to link with file built for unknown-arm64

Nonetheless the two halves of the resulting libchtslib.cpython-310-darwin.so do indeed contain their respective SIMD implementation translation units, so this is good enough for pysam in the interim.

Full build log at the m1-ci branch at jmarshall/pysam and cirrus.ci/…/test.log.

@jkbonfield: This also demonstrates a (somewhat inconvenient) test methodology: Set up a Cirrus CI Apple Silicon build job.

@jkbonfield
Copy link
Collaborator

@daviesrob 's PR #78 already includes a cirrus-CI modificatoin to do multi-arch building and test it. I'm hoping it's sufficient.

I'm not sure what you mean by "filter out", but I assume it's in relation to Rob's comments on how his PR works, where the code if ifdefed out rather than the Makefile selecting or deselecting specific files. Given we have a single configure line, a single Makefile, but two architectures, it feels like we must compile every object file and just cope with the fact that some aren't appropriate by them becoming minimalist when the detected CPU doesn't match. I don't really see how else to solve this given the constraints of how multi-arch works.

Open to suggestions though.

@jmarshall
Copy link
Member Author

“Filter out” is a reference to #76 (comment): the pysam PR literally removes -arch arm64 from CFLAGS for the AVX object files and removes -arch x86_64 from CFLAGS for the Neon object file. It works, but with some niggles.

That approach solves pysam's problem in the current htslib release, but I think Rob's PR's approach will be better in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants