Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA support #172

Closed
wants to merge 50 commits into from
Closed

Add CUDA support #172

wants to merge 50 commits into from

Conversation

huebner-m
Copy link
Contributor

@huebner-m huebner-m commented May 12, 2022

This WIP PR follows up on the work done in the Hackathons.
The following features have been implemented:

  • Check that NVIDIA drivers are installed on host (via nvidia-smi)
  • Get latest CUDA compat libs from NVIDIA website (automatically picking the correct OS version etc.)
  • Check that host_injections is a writable path
  • Install CUDA compat libs in host_injections (using the appropriate rpm or deb files and tools)
  • Check if CUDA is already installed as a module
  • Check if disk space in host_injections is sufficient to install CUDA
  • Install CUDA using EasyBuild
  • Add test based on CUDA samples to check if GPU support works
  • Add EasyBuild hook to tag software that depends on CUDA with the property gpu (based on https://github.com/easybuilders/JSC/blob/2022/Custom_Hooks/eb_hooks.py#L335)
  • Add Lmod plugin that hides software with the property gpu if CUDA is not installed

Open tasks:

  • Only download/install CUDA compat libs from NVIDIA website if necessary (wip, see: 2cc5ce9)
  • Update Lmod version to make plugin work, also see: Use properties in isVisible hook TACC/Lmod#552
  • Ship CUDA samples with EESSI
  • Ship p7zip with software layer
  • Tested with Debian10 container, test other OSs

Related issues:

CUDA itself is not shipped by EESSI and has to be installed on
the host. The scripts perform various checks and download and
install the CUDA compat libs.
Modules with CUDA as a dependecy are hidden in Lmod, unless
the CUDA compat libs are installed which is only done when CUDA
itself is installed on the host. This aspect still has to be
tested with an updated Lmod version in the EESSI compat layer.
Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encouraging!

eb_hooks.py Outdated Show resolved Hide resolved
or ec_dict["toolchain"]["name"] in CUDA_ENABLED_TOOLCHAINS
):
key = "modluafooter"
value = 'add_property("arch","gpu")'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gpu is a recognised property in Lmod so a good choice for now. Once we add AMD support it will get more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a new property by extending the property table propT. To do so, we could add a file init/lmodrc.lua with a new property. This file can be loaded using the env var $LMOD_RC. Unfortunately, we do not seem to be able to add entries to arch but rather have to add a new property (or find a way to extend arch that I'm missing).

# TODO: needs more thorough testing
os_family=$(uname | tr '[:upper:]' '[:lower:]')

# Get OS version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel Does EB do this already, can we hook into that?

gpu_support/add_gpu_support.sh Outdated Show resolved Hide resolved
gpu_support/add_gpu_support.sh Outdated Show resolved Hide resolved
gpu_support/test_cuda Outdated Show resolved Hide resolved
gpu_support/test_cuda Outdated Show resolved Hide resolved
module use /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/modules/all/
module load CUDA
tmp_dir=$(mktemp -d)
cp -r $EBROOTCUDA/samples $tmp_dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA no longer ships samples, there is a plan to ship the (compiled) CUDA samples with EESSI

gpu_support/test_cuda Outdated Show resolved Hide resolved
init/SitePackage.lua Outdated Show resolved Hide resolved
@ocaisa
Copy link
Member

ocaisa commented May 13, 2022

There are indications that we may be allowed to ship the CUDA runtime with EESSI, that would mean we wouldn't (necessarily) need to install CUDA unless people want to actually build their own software on top of EESSI.

I would go with this PR as is right now, but in a future pilot we should make that installation optional (making another code branch that only creates the software installation directory so that the Lmod plugin will still work). In that scenario though, we will need some major tweaking to the CUDA module shipped with EESSI, it would need conditionals based on what is available on the host. We'd also probably want to shadow nvcc and friends with a nice echo script that explains what is needed to get a fully functional installation.

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's time to remove the WIP label and get someone else feedback on this!

gpu_support/add_nvidia_gpu_support.sh Show resolved Hide resolved
gpu_support/add_amd_gpu_support.sh Show resolved Hide resolved
@huebner-m huebner-m changed the title Add CUDA support [WIP] Add CUDA support May 19, 2022
Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow using an environment variable to skip GPU checks

gpu_support/add_nvidia_gpu_support.sh Show resolved Hide resolved
gpu_support/add_nvidia_gpu_support.sh Show resolved Hide resolved
eb_hooks.py Outdated Show resolved Hide resolved
gpu_support/test_cuda Outdated Show resolved Hide resolved
module load EasyBuild
# we need the --rebuild option, since the module file is shipped with EESSI
tmpdir=$(mktemp -d)
eb --rebuild --installpath-modules=${tmpdir} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now this makes testing difficult as the actual CUDA module in EESSI is not available. You might be better off here checking if the CUDA module exists and if so prefixing this command with EASYBUILD_INSTALLPATH_MODULES=${tmpdir}

# The rpm and deb files contain the same libraries, so we just stick to the rpm version.
# If p7zip is missing from the software layer (for whatever reason), we need to install it.
# This has to happen in host_injections, so we check first if it is already installed there.
module use ${cuda_install_dir}/modules/all/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do this conditionally (i.e, only if this directory exists)

Comment on lines 79 to 81
# we need the --rebuild option, since the module file is shipped with EESSI
tmpdir=$(mktemp -d)
eb --rebuild --installpath-modules=${tmpdir} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# we need the --rebuild option, since the module file is shipped with EESSI
tmpdir=$(mktemp -d)
eb --rebuild --installpath-modules=${tmpdir} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
# we need the --rebuild option and a random dir for the module if the module file is shipped with EESSI
if [ -f ${EESSI_SOFTWARE_PATH}/modules/all/CUDA/${install_cuda_version}.lua ]; then
tmpdir=$(mktemp -d)
extra_args="--rebuild --installpath-modules=${tmpdir}"
fi
eb ${extra_args} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb

gpu_support/test_cuda.sh Outdated Show resolved Hide resolved
echo "Cannot test CUDA, modules path does not exist, exiting now..."
exit 1
fi
module load CUDA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably load the specific version of CUDA here.

@ocaisa
Copy link
Member

ocaisa commented Sep 13, 2022

For me to get this working out of the box right now I needed

diff --git a/gpu_support/add_nvidia_gpu_support.sh b/gpu_support/add_nvidia_gpu_support.sh
index 025a1be..9cf4b70 100755
--- a/gpu_support/add_nvidia_gpu_support.sh
+++ b/gpu_support/add_nvidia_gpu_support.sh
@@ -76,9 +76,12 @@ else
   fi
   # install cuda in host_injections
   module load EasyBuild
-  # we need the --rebuild option, since the module file is shipped with EESSI
-  tmpdir=$(mktemp -d)
-  eb --rebuild --installpath-modules=${tmpdir} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
+  # we need the --rebuild option and a random dir for the module if the module file is shipped with EESSI
+  if [ -f ${EESSI_SOFTWARE_PATH}/modules/all/CUDA/${install_cuda_version}.lua ]; then
+    tmpdir=$(mktemp -d)
+    extra_args="--rebuild --installpath-modules=${tmpdir}"
+  fi
+  eb ${extra_args} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
   ret=$?
   if [ $ret -ne 0 ]; then
     echo "CUDA installation failed, please check EasyBuild logs..."
@@ -97,6 +100,12 @@ if [[ $? -eq 0 ]]; then
   echo "p7zip module found! No need to install p7zip again, proceeding with installation of compat libraries"
 else
   # install p7zip in host_injections
+  export EASYBUILD_IGNORE_OSDEPS=1
+  export EASYBUILD_SYSROOT=${EPREFIX}
+  export EASYBUILD_RPATH=1
+  export EASYBUILD_FILTER_ENV_VARS=LD_LIBRARY_PATH
+  export EASYBUILD_FILTER_DEPS=Autoconf,Automake,Autotools,binutils,bzip2,cURL,DBus,flex,gettext,gperf,help2man,intltool,libreadline,libtool,Lua,M4,makeinfo,ncurses,util-linux,XZ,zlib
+  export EASYBUILD_MODULE_EXTENSIONS=1
   module load EasyBuild
   eb --robot --installpath=${cuda_install_dir}/ p7zip-${install_p7zip_version}.eb
   ret=$?

@ocaisa
Copy link
Member

ocaisa commented Sep 15, 2022

The discussion today at the EESSI Community Meeting, led to the following design suggestion/comments:

  • According to the CUDA EULA we can distribute the CUDA runtime. It looks like this maps more or less to the lib directory under the CUDA installation, even if not the EULA essentially gives a whitelist.
  • We can create post install hook for the CUDA installation that replaces anything not in the whitelist with a symlink to the equivalent location under host_injections. When the host_injections path has a CUDA install, CUDA is fully capable, otherwise it is just the runtime (i.e., we only ship the runtime).
  • We use an EasyBuild hook so that CUDA will be downgraded to a build dependency (or simply excluded from the final module file). Our use of rpath means software built with CUDA will still work (since they will link to the runtime, which we will ship)
  • Lmod will need to be smart on two fronts. It will need a hook that looks at the Lmod gpu property and check for the compat library before allowing the module to load (with a helpful error if it needs to fail). For the CUDA module itself, it will also need to refuse to load it unless the symlinks inside have been resolved.
  • For the verification that CUDA is actually working, we will need to ship a compiled deviceQuery (or whatever), since there is no guarantee the nvcc compiler will be available.

The benefit of this approach is that we only install the CUDA SDK if the user actually wants it. It will greatly speed up the GPU support script since there will be need for any eb installations at all (by default).

Whitelisted CUDA libraries can now be shipped with EESSI. The other
libraries and files are replaced with symlinks to host_injections.
A compiled CUDA sample can now also be shipped with EESSI. This is
relevant if users only need the runtime capabilities and not the
whole CUDA suite (which would include the compilers). It is now
possible to solely install the compat libs as a user and get access
to the runtime environment this way. It is still possible to also
install the whole CUDA suite.
CUDA enabled modules with the gpu property now only load if the
compat libs are installed in host_injections.
The CUDA version needed for modules are now written as envvars
that will be exported into the module files. The CUDA version for
which we have the current compat libs installed is saved in a txt
file in ../host_injections/nvidia/latest/version.txt
The lmod hook called when loading modules with the gpu property
now compares these two versions and exits out if the installed
version needs to be updated.
The fix for removing the temporary test dir is needed when cloning
the samples from github, i.e. for CUDA > 11.6.0. Otherwise, the
script call from the eb hook will get stuck.
Comment on lines +59 to +65
###############################################################################################
# Install CUDA
cuda_install_dir="${EESSI_SOFTWARE_PATH/versions/host_injections}"
mkdir -p ${cuda_install_dir}
if [ "${install_cuda}" != false ]; then
bash $(dirname "$BASH_SOURCE")/cuda_utils/install_cuda.sh ${install_cuda_version} ${cuda_install_dir}
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's break this into separate script (and PR) since it will be needed by #212

You also need to check the exit code on the creation of cuda_install_dir since this may fail

bash $(dirname "$BASH_SOURCE")/cuda_utils/install_cuda.sh ${install_cuda_version} ${cuda_install_dir}
fi
###############################################################################################
# Prepare installation of CUDA compat libraries, i.e. install p7zip if it is missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop stuff related to p7zip because of #212 (and that also means we can drop prepare_cuda_compatlibs.sh entirely)

# Otherwise, give up
bash $(dirname "$BASH_SOURCE")/cuda_utils/install_cuda_compatlibs_loop.sh ${cuda_install_dir} ${install_cuda_version}

cuda_version_file="/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/version.txt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cuda_version_file="/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/version.txt"
cuda_version_file="/cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/eessi_compat_version.txt"

I also think that this creation should be part of install_cuda_compatlibs_loop.sh and we should put the supported CUDA version in there according to the compat libs, not the version we need (will help us to avoid unnecessary updates in the future).

install_cuda_version=$1
cuda_install_dir=$2

# TODO: Can we do a trimmed install?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done now via your hook

#!/bin/bash

install_cuda_version=$1
cuda_install_dir=$2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General CUDA installation is done via #212 now so I don't think you need this argument. This script is only about installing the CUDA package under host_injections...but changing the name of the script to reflect that is probably a good idea.

# This is only relevant for users, the shipped CUDA installation will
# always be in versions instead of host_injections and have symlinks pointing
# to host_injections for everything we're not allowed to ship
if [ -f ${cuda_install_dir}/software/CUDA/${install_cuda_version}/EULA.txt ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if/else is still good, except we should be checking under the host_injections path. This will allow us to skip any check on available space etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should construct cuda_install_dir rather than take it as an argument

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should allow for a forced installation to override this check

Copy link
Member

@ocaisa ocaisa Dec 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prefer the we ship the EULA text so I think we should check for an expected broken symlink here:

if [[ -L "${cuda_install_dir}/software/CUDA/bin/nvcc" && -e "${cuda_install_dir}/software/CUDA/bin/nvcc" ]]; then

Comment on lines +17 to +20
avail_space=$(df --output=avail ${cuda_install_dir}/ | tail -n 1 | awk '{print $1}')
if (( ${avail_space} < 16000000 )); then
echo "Need more disk space to install CUDA, exiting now..."
exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky one, we need space for sources, space for the build, space for the install but people can choose where to put all these. I guess we leave it as is for now, but allow people to set an envvar to override this check (and tell them that envvar in the error message)

# install cuda in host_injections
module load EasyBuild
# we need the --rebuild option and a random dir for the module if the module file is shipped with EESSI
if [ -f ${EESSI_SOFTWARE_PATH}/modules/all/CUDA/${install_cuda_version}.lua ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this script is standalone, we'll need to guarantee EESSI_SOFTWARE_PATH is defined.

extra_args="--rebuild --installpath-modules=${tmpdir}"
fi
eb ${extra_args} --installpath=${cuda_install_dir}/ CUDA-${install_cuda_version}.eb
ret=$?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's import the bash functions defined in utils.sh and use them throughout (where appropriate)

@ocaisa
Copy link
Member

ocaisa commented Dec 16, 2022

@huebner-m Let's branch out a separate PR for the script to install CUDA under host_injections

if [ -w /cvmfs/pilot.eessi-hpc.org/host_injections ]; then
mkdir -p ${host_injections_dir}
else
echo "Cannot write to eessi host_injections space, exiting now..." >&2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start using utils.sh here

fi
cd ${host_injections_dir}

# Check if our target CUDA is satisfied by what is installed already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what our target CUDA version is at this point. And if the nvidia-smi result is good enough, what then? I guess we should check that this version comes from an installation of the compat-libs otherwise we still need to install compat libs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the supported CUDA version is new enough and comes from an EESSI installation of the CUDA compat libs, we can already exit gracefully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should leverage the contents of /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/eessi_compat_version.txt here

@@ -0,0 +1,92 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a compat layer bash?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine as is, as long as the first thing we do is source the EESSI environment

Comment on lines +40 to +43
# p7zip is installed under host_injections for now, make that known to the environment
if [ -d ${cuda_install_dir}/modules/all ]; then
module use ${cuda_install_dir}/modules/all/
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop this

fi

# Create the space to host the libraries
mkdir -p ${host_injection_linker_dir}
Copy link
Member

@ocaisa ocaisa Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should always check exit codes on our commands, seems like a function that does that for us is needed

Comment on lines +76 to +81
if [ -d ${cuda_install_dir}/modules/all ]; then
module use ${cuda_install_dir}/modules/all/
else
echo "Cannot load CUDA, modules path does not exist, exiting now..."
exit 1
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this, no need for the module use, CUDA-Samples is shipped with EESSI. Our Lmod hook should cause the load of (a specific version of) CUDA-Samples (not CUDA since we only deal with compat libs here) to fail unless the compat libs are in place (i.e. Lmod should ensure the existence of /cvmfs/pilot.eessi-hpc.org/host_injections/nvidia/latest/eessi_compat_version.txt)

exit 1
else
echo "Successfully loaded CUDA, you are good to go! :)"
echo " - To build CUDA enabled modules use ${EESSI_SOFTWARE_PATH/versions/host_injections} as your EasyBuild prefix"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required, then can build where they like (but it is a very sensible location!)

@@ -0,0 +1,31 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script can be dropped

@@ -0,0 +1,82 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script can be greatly simplified, just load CUDA-Samples and see does deviceQuery run

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expand the testing from there.

@ocaisa
Copy link
Member

ocaisa commented Dec 16, 2022

@huebner-m Break the two compat libs scripts off into another PR, the testing script into another, the lmod hook into another and the docs into another (we can finalise those once the others are ready)

@ocaisa
Copy link
Member

ocaisa commented Dec 21, 2023

GPU support implemented with #434

@ocaisa ocaisa closed this Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants