-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make check fails when compiling for gpu #178
Comments
Thank you for the report, could you post the contents of test-suite.log so we can dig a bit deeper? Yes we plan to use meson to build for GPU aswell with a view to making it the default build system, @amontoison is currently looking into this. |
Oh, it just can't find libcoinmetis.so.2:
Can confirm that lib is where I would have expected it to be:
If it helps, my formal training is in mechanical engineering rather than CS, so there might be something very simple I'm missing. I just reinstalled ubuntu and followed the instructions I listed above again (i.e. if there's something simple that you are assuming competent people do with their ubuntus not listed above, I am not competent, and I did not do it). Also note that I have been running from spral's commit 5e8b409 since that was the last time COMPILE.md was modified, but I just tried again from spral's master branch, same result and log. Thanks! |
Can you try with this precompiled metis? |
following up on the error, it looked like LD_LIBRARY_PATH might help spral find the metis. I switched to your precompiled metis, Then with metisdir pointing to your precompiled metis
I did
so
reconfigured a new checkout of spral with
The make check gets further:
I thought it might be that now my installation of hwloc (apt-get hwloc libhwloc-dev) is holding me back because it wasn't compiled on my machine for gpus? I then tried
But then recompiling spral, the make fails to compile
Then I removed hwloc's repo and tried to reinstall hwloc and libhwloc-dev, but recompiling gives me the same bug, so something is messed up about my system so I'll just do a fresh install. |
Yeah I would use the precompiled dependencies that ship with ubuntu, for our CI tests on ubuntu we do:
and then setup as follows:
which works for us. We're working on getting a VM so that we can test GPU on the CI as well. |
But of course in your case you'll need to compile hwloc from source with CUDA support. I wouldn't use the development master branch for this though, version 2.8.0 of hwloc has worked for us in the past: |
Installing libmetis-dev and compiling hwloc from the v2.8 branch then calling ldconfig, then calling with the configure line above gets the make check tests to pass. I'm not sure if the tests check for gpu usage, but spamming nvidia-smi when the make check is running doesn't seem to show any gpu usage. One other issue is getting Ipopt to use spral, which I believe requires a .so shared library rather than the libspral.a. Following the directions on the readme:
gives me
which suggests some cuda code hasn't been compiled as a .so? I tried reconfiguring spral by adding -fPICs and flags for nvcc from previous instructions, and also the -lcudadevrt et al to the libs argument:
which configures fine, although I see some warnings including these, which seem worrisome considering I think code sm_86 should work for my 3080 gpu?
regardless, the code builds fine, but then make check fails with
with a test-suite.log of
But I am able to create a .so with: I can also from there compile ipopt and get it to run with spral successfully. However, I can't get it to every seemingly use the gpu regardless of the cpu/gpu weighting and min_work. One interesting thing is when I tried adding the cuda libs but not -fPIC:
Then make worked with fewer warnings about cuda codes: (sm_86 didn't get skipped anymore?)
make check succeeds, but creating the .so fails again, but with a different error?
Is there a way to use the libspral.a in Ipopt? Is there a way with just spral itself to test whether it is using the gpu? Should the gpu be used in make check? |
If you compile SPRAL with OpenBLAS, you need different link flags for generating the shared library: gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lopenblas -lhwloc -lmetis -lstdc++ -o libspral.so If you use a GPU, you probably need to add additional flags: gfortran -fPIC -shared -Wl,--whole-archive libspral.a -Wl,--no-whole-archive -lgomp -lopenblas -lhwloc -lmetis -lstdc++ -lcudadevrt -lcudart -lcuda -lcublas -o libspral.so If you generate a shared libraries, you just need the link flag For the question about |
@johnmatt3 Can you try to compile SPRAL with Meson? |
Thanks for the update! I enlisted a real software buddy of mine (henceforth referred to as the linux sherpa) to help guide me through this attempt, so I got a bit further than usual. Note that I ran this on the code from a couple weeks ago. TLDR: spral appears to be finding, but not using, my gpu (looks to my untrained eye like an issue with the guess_topology logic?). If run as root and force the hwloc wrapper code to use my gpu, then ssidst loads something onto my gpu but the tests error (what looks like a combination of "fail residual" errors, and a potential memory allocation issue). My configuration for this attempt:
CPU build is fine
the ssidst all seem to pass ok, total number of errors is 0. Basic GPU build failsin shell:
The build shows gpu: true and it finds various cuda libraries successfully.
Then the tests all pass but nvidia-smi shows no new process on the gpu and no apparent additional gpu usage beyond baseline. Looking in the code we added some printfs to see if the gpu is getting used, hwloc is getting found, etc.
running "gpu build and test instructions", the configure and compilation appear successful. The tests yelds a lot of
prints, but no HAVE_NVCC prints or any of the other printfs. This suggested hwloc was not finding my GPU and that HAVE_NVCC was false. Forcing HAVE_NVCC to true, hwloc finds my GPU, but fails to select it for use?We surmised that HAVE_NVCC should be true when -Dgpu=true?, so adding this line at line 92 of meson.build:
Then rerunning "gpu build and test instructions", we get a lot of prints, the script ends like this:
Still no new process shows up in nvidia-smi, and no appreciable extra gpu usage is reported. Digging deeper (and this is where I definitely would have died on the mountain without a linux sherpa):
Running ssidst as root loads it onto the gpu (!), but random matrix test SIGABRTS, core dumpsThe permission denieds suggested running as root (although we still configure and compile as a nonroot user)
which finally popped a new process up on nvidia-smi!
Turning off those prints (which are as expected given the gpu is being automatically returned regardless of numa node parentage), All tests are "ok", until:
or another run:
Not sure at this point if it's worth going too much further as presumably the hwloc numa node stuff wants to be fixed? However, in case you wanted more info here, running the build instructions with:
|
Many thanks for the very detailed bug report! It does indeed look like SPRAL is broken on GPU… @AndrewLister-STFC @haldaas @tyronerees what do you guys think? |
I am trying to install spral with gpu support (for eventual use in Ipopt). My hardware (and goal) is the same as a previous issue. Feel free to close this issue if the near-term plan is to use meson entirely moving forward in a way that will reliably build for use in Ipopt. I also tried the current meson build method, setting the gpu option resulted in an error because it didn't know how to compile .cu files. I added cuda to the project languages, but it couldn't find nvcc to compile with (nvcc is available on the command line). I presume this approach is getting fixed up based on this issue. Anyway, here's my results with the configure/make instructions I could find.
My software setup is
fresh install of ubuntu 22.04 LTS
Get cuda, from:
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=18.04&target_type=deb_local
Not sure if this hwloc will let me have gpu support. Instructions suggest it must be installed from source but then I believe spral wouldn't even compile (using whatever the git clone of hwloc gave me)
Redo the cuda exports above.
Going to try to install coin-or's metis to hopefully get compatibility with Ipopt down the road?
(from https://github.com/ralna/spral/blob/master/COMPILE.md)
then compile spral
Compiles fine, make check results in:
Thanks!
The text was updated successfully, but these errors were encountered: