[WIP] Migrate away from centos7 #3

cazlo · 2024-10-18T19:20:24Z

What

migrate the rocm build to rockylinux 8, run a performance test, and observed it still runs about the same as HEAD of main
migrate cuda build layers to rockylinux 8
migrate cpu build layers to rockylinux 8
~~try everything on rockylinux 9~~ opting to not do this b/c the feedback loop between tests is too high given current HW availability. rocky8 good enough (.tm) for now
reopen against ollama/ollama if this generally works ok
- remove all the rootless docker compose stuff added to facilitate the local testing. if this gets cleaned up and generic, it would get its own PR against upstream
- run as many unit test on your HW as possible b4 submitting PR to upstream

Why

rocm rebuild is like 30 mins not sure --link will work here could probably avoid this if llama.cpp were subtree instead of submodule, avoiding need to copy into the image build stuff that changes all the time like .git

it still compiles and runs ok on 7900xt

cazlo · 2024-10-18T19:34:59Z

Notes:

all these hacks for working around centos 7 EOL and old dependencies go away plz https://github.com/ollama/ollama/blob/main/scripts/rh_linux_deps.sh#L10-L33
this might not be needed anymore, no mention of lack of gcc 10.3 at nvidia docs page, so this might have just been some transient bug in upstream cuda stuff: https://github.com/ollama/ollama/blob/main/scripts/rh_linux_deps.sh#L35-L50
- https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#host-compiler-support-policy says gcc 6.x - 13.2 supported
if moving to rockylinux 9, it would probably be gcc-toolset-12 or gcc-toolset-13 not gcc-toolset-10 we need
I'm assuming there will be some build flags/cmake stuff that would need to be updated in upstream, though initial read through is not showing anything obvious https://github.com/ggerganov/llama.cpp/blob/master/CMakeLists.txt

cazlo · 2024-10-18T21:14:19Z

rocky9 is going to be a little annoying b/c nvidia only publishes rocky9 images for cuda 12.3.0 and later.

the ollama build will use cuda dlls for 12.4.0 and 11.3.1

so we'd need to either build a rocky9 + cuda 11.3.1 image or amend the rh_linux_deps script to work for both rocky8 and rocky9 (currently it only works on rocky8)

cazlo · 2024-10-18T21:25:30Z

the gcc peg to 10.2 looks like it could go away if cuda gets a version bump.

currently ollama using cuda 11.3.1 whose latest supported gcc apears to be 9.x (see https://docs.nvidia.com/cuda/archive/11.3.1/cuda-installation-guide-linux/index.html)

looking at cuda 11.7.1 it supports gcc 11.x, with rhel9 officially supported (see https://docs.nvidia.com/cuda/archive/11.7.1/cuda-installation-guide-linux/index.html)

looking at upstream llama.cpp Dockerfiles, finding evidence it was using cuda 11.7.1 (successfully?) until like 2024-09 (see ggerganov/llama.cpp@66b039a)

so I bet if we bump to cuda 11.7.1, we can migrate the builds to rocky8 without too much hastle and remove all the gcc pinning hacks

performance tests showing no significant difference, indicating the thing works as good as it did before the changes

cazlo · 2024-10-18T22:45:02Z

based on the git blame for the gcc 10.2 pin, should probably at least do a arm64 build and make sure it builds before opening PR against upstream.

I don't have HW avail to test the arm build and would prefer not to rent it, so upstream maintainers will have to do runtime checks

see also dhiltgen@5dacc1e
ollama@b8c2be6

cazlo · 2024-10-18T23:44:19Z

upstream ollama/ollama just recently made a change to 'vendor in' the upstream llama.cpp code instead of relying on pulling this via submodule

before PR to upstream, make sure to merge this change in to the unit under test

edit later: done

cazlo · 2024-10-19T00:28:31Z

arm64 build is really long on amd64 box and using qemu for arm emulation. > 1 hr to get only 48% through the build (with CPU pegged at 100% the whole time).

if we trigger the runners ci step (e.g. through changes to llama dir) it seems like CI will run this on actual arm runners for much less time

about the gcc 10.3 + nvcc issue, this thread seems to be valuable:

compile time issues with nvcc and GCC 10.3 alpaka-group/alpaka#1297

see also https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5357ab75dedef403b0eebf9277d61d1cbeb5898f

seems like the bug is segfault in gcc, something I havent been able to reproduce so far on any builds

I cant find any evidence RH or others backported this to gcc 10.3.1 (the gcc version used when installing gcc-toolset-10) on rhel8.

So seems best to bump to gcc 11.2+ given:

it has the bugfix in it
it is listed as supported by cuda and rocm docs
rhel8 gives us gcc 11.2.1 if we ask for gcc-toolset-11-gcc

cazlo · 2024-10-19T00:50:31Z

finding some evidence the nvcc compile bug was fixed in upstream cuda 11.6.0:
https://docs.nvidia.com/cuda/archive/11.6.0/cuda-toolkit-release-notes/index.html

An issue with the use of lambda function when an object is passed-by-value is resolved. https://github.com/Ahdhn/nvcc_bug_maybe

symptom presented at https://github.com/Ahdhn/nvcc_bug_maybe is similar to the segfault alpaka is talking about

Dockerfile

scripts/rh_linux_deps.sh

initial smokechecks showing successful rocm amd64 compile + equivalent performance in smoke tests

cazlo · 2024-10-19T01:37:51Z

Dockerfile

 ARG GOLANG_VERSION=1.22.5
 ARG CMAKE_VERSION=3.22.1
-ARG CUDA_VERSION_11=11.3.1
+ARG CUDA_VERSION_11=11.7.1


11.7.1 was chosen for consistency with upstream llama.cpp. (see also ggerganov/llama.cpp@66b039a) 11.8.0 is actually latest available in the 11 major cuda version.

looking at https://docs.nvidia.com/cuda/archive/11.8.0/cuda-toolkit-release-notes/index.html, not seeing anything hugely compelling outside of rocky 9 support:

11.8
This release introduces support for both the Hopper and Ada Lovelace GPU families.
Added support for Rocky Linux 9.
Added support for Kylin OS.
Package upgradable CUDA is now available starting CUDA 11.8 for Jetson devices. Refer to https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#upgradable-package-for-jetson for details on how to upgrade to the latest CUDA version on Jetson and the supported JetPack versions.****

cazlo · 2024-10-19T03:31:47Z

supplanted by ollama#7265

cazlo added 5 commits October 16, 2024 11:10

rootless amdgpu runtime via docker compose

50d0f25

improved docker rebuild layer caching optimization

f39b24a

avoid rebuild of rocm when changing branch

a53b437

rocm rebuild is like 30 mins not sure --link will work here could probably avoid this if llama.cpp were subtree instead of submodule, avoiding need to copy into the image build stuff that changes all the time like .git

[chkpt] messing around with stable diffusion

de2966a

rocm rocky8 builder layer

16380f0

it still compiles and runs ok on 7900xt

[ckpt] functioning rocm build with no centos

fd61e1a

performance tests showing no significant difference, indicating the thing works as good as it did before the changes

cazlo force-pushed the migrate-away-from-centos7 branch from d727785 to fd61e1a Compare October 18, 2024 22:41

cazlo commented Oct 19, 2024

View reviewed changes

Dockerfile Show resolved Hide resolved

cazlo commented Oct 19, 2024

View reviewed changes

scripts/rh_linux_deps.sh Show resolved Hide resolved

cazlo added 3 commits October 18, 2024 18:18

[ckpt] functioning rocm build using gcc 11.2.1

c6f457e

initial smokechecks showing successful rocm amd64 compile + equivalent performance in smoke tests

[to drop] noop to force CI run

2762efb

Merge branch 'main' into migrate-away-from-centos7

98fdf70

cazlo commented Oct 19, 2024

View reviewed changes

cazlo mentioned this pull request Oct 19, 2024

Migrate off centos 7 for intermediate build layers in container image builds ollama/ollama#7265

Open

cazlo closed this Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Migrate away from centos7 #3

[WIP] Migrate away from centos7 #3

cazlo commented Oct 18, 2024 •

edited

Loading

cazlo commented Oct 18, 2024

cazlo commented Oct 18, 2024

cazlo commented Oct 18, 2024 •

edited

Loading

cazlo commented Oct 18, 2024

cazlo commented Oct 18, 2024 •

edited

Loading

cazlo commented Oct 19, 2024 •

edited

Loading

cazlo commented Oct 19, 2024 •

edited

Loading

cazlo Oct 19, 2024 •

edited

Loading

cazlo commented Oct 19, 2024

[WIP] Migrate away from centos7 #3

[WIP] Migrate away from centos7 #3

Conversation

cazlo commented Oct 18, 2024 • edited Loading

What

Why

cazlo commented Oct 18, 2024

cazlo commented Oct 18, 2024

cazlo commented Oct 18, 2024 • edited Loading

cazlo commented Oct 18, 2024

cazlo commented Oct 18, 2024 • edited Loading

cazlo commented Oct 19, 2024 • edited Loading

cazlo commented Oct 19, 2024 • edited Loading

cazlo Oct 19, 2024 • edited Loading

Choose a reason for hiding this comment

cazlo commented Oct 19, 2024

cazlo commented Oct 18, 2024 •

edited

Loading

cazlo commented Oct 18, 2024 •

edited

Loading

cazlo commented Oct 18, 2024 •

edited

Loading

cazlo commented Oct 19, 2024 •

edited

Loading

cazlo commented Oct 19, 2024 •

edited

Loading

cazlo Oct 19, 2024 •

edited

Loading