Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

../tools/scripts/run_cuda_tests Fails on WSL2 #1533

Open
CLRafaelR opened this issue Aug 13, 2024 · 3 comments
Open

../tools/scripts/run_cuda_tests Fails on WSL2 #1533

CLRafaelR opened this issue Aug 13, 2024 · 3 comments
Labels

Comments

@CLRafaelR
Copy link

CLRafaelR commented Aug 13, 2024

I am running Ubuntu 24.04 LTS on Windows Subsystem for Linux 2 (WSL2) and trying to perform programming tasks using an NVIDIA GeForce RTX 3060 GPU via OpenCL. Initially, running clinfo -l did not display any platforms or devices due to the issue described in No OpenCL platforms reported · Issue #6951 · microsoft/WSL. However, following the solution provided in this comment, I installed PoCL, and now clinfo and clinfo -l recognise my GPU.

Questions

Despite successfully building and installing PoCL, and having clinfo and clinfo -l functioning correctly, I encounter four failed tests when running the PoCL verification test ../tools/scripts/run_cuda_tests for NIVIDIA GPU shown below and NVIDIA GPU support — Portable Computing Language (PoCL) 6.0 documentation:

cd ~/pocl-6.0/build # move to my `build` directory
../tools/scripts/run_cuda_tests

# For rerunning the failed tests:
../tools/scripts/run_cuda_tests --rerun-failed --output-on-failure

Failed tests were:

The following tests FAILED:
          4 - kernel/test_as_type_loopvec (Failed)
        166 - regression/clSetKernelArg_overwriting_the_previous_kernel's_args_loopvec (Failed)
        208 - runtime/test_device_address (SEGFAULT)
        209 - runtime/test_svm (SEGFAULT)
Errors while running CTest

Regarding these four failed tests, I have two questions:

  1. What do these tests assess?
  2. What can I do to ensure these tests pass? Any guidance on checking path settings or installing additional packages would be greatly appreciated.

Error log

Start testing: Aug 13 21:29 JST
----------------------------------------------------------
4/264 Testing: kernel/test_as_type_loopvec
4/264 Test: kernel/test_as_type_loopvec
Command: "/usr/bin/cmake" "-Dtest_cmd=/home/a6m1/pocl-6.0/build/tests/kernel/kernel####test_as_type" "-P" "/home/a6m1/pocl-6.0/cmake/run_test.cmake"
Directory: /home/a6m1/pocl-6.0/build/tests/kernel
"kernel/test_as_type_loopvec" start time: Aug 13 21:29 JST
Output:
----------------------------------------------------------
Running test test_as_type...
FAIL: as_char3((char4)) - byte #: <format error> expected: <format error>.2x actual: <format error>.2x
OK


<end of output>
Test time =   5.30 sec
----------------------------------------------------------
Test Fail Reason:
Error regular expression found in output. Regex=[FAIL]
"kernel/test_as_type_loopvec" end time: Aug 13 21:29 JST
"kernel/test_as_type_loopvec" time elapsed: 00:00:05
----------------------------------------------------------

208/264 Testing: runtime/test_device_address
208/264 Test: runtime/test_device_address
Command: "/home/a6m1/pocl-6.0/build/tests/runtime/test_device_address"
Directory: /home/a6m1/pocl-6.0/build/tests/runtime
"runtime/test_device_address" start time: Aug 13 21:29 JST
Output:
----------------------------------------------------------
NVIDIA GeForce RTX 3060 OpenCL 3.0 PoCL HSTR: CUDA-sm_75: suitable
<end of output>
Test time =   0.98 sec
----------------------------------------------------------
Test Failed.
"runtime/test_device_address" end time: Aug 13 21:29 JST
"runtime/test_device_address" time elapsed: 00:00:00
----------------------------------------------------------

209/264 Testing: runtime/test_svm
209/264 Test: runtime/test_svm
Command: "/home/a6m1/pocl-6.0/build/tests/runtime/test_svm"
Directory: /home/a6m1/pocl-6.0/build/tests/runtime
"runtime/test_svm" start time: Aug 13 21:29 JST
Output:
----------------------------------------------------------
<end of output>
Test time =   0.97 sec
----------------------------------------------------------
Test Failed.
"runtime/test_svm" end time: Aug 13 21:29 JST
"runtime/test_svm" time elapsed: 00:00:00
----------------------------------------------------------

166/264 Testing: regression/clSetKernelArg_overwriting_the_previous_kernel's_args_loopvec
166/264 Test: regression/clSetKernelArg_overwriting_the_previous_kernel's_args_loopvec
Command: "/usr/bin/cmake" "-Dtest_cmd=/home/a6m1/pocl-6.0/build/tests/regression/test_setargs" "-P" "/home/a6m1/pocl-6.0/cmake/run_test.cmake"
Directory: /home/a6m1/pocl-6.0/build/tests/regression
"regression/clSetKernelArg_overwriting_the_previous_kernel's_args_loopvec" start time: Aug 13 21:29 JST
Output:
----------------------------------------------------------
CMake Error at /home/a6m1/pocl-6.0/cmake/run_test.cmake:34 (message):
  FAIL: Test exited with nonzero code (1):
  /home/a6m1/pocl-6.0/build/tests/regression/test_setargs

  STDOUT:

  FAIL



  STDERR:

  3876879362



<end of output>
Test time =   1.05 sec
----------------------------------------------------------
Test Fail Reason:
Error regular expression found in output. Regex=[FAIL]
"regression/clSetKernelArg_overwriting_the_previous_kernel's_args_loopvec" end time: Aug 13 21:30 JST
"regression/clSetKernelArg_overwriting_the_previous_kernel's_args_loopvec" time elapsed: 00:00:01
----------------------------------------------------------

End testing: Aug 13 21:30 JST

cuda =   8.31 sec*proc

hsa-native =   5.30 sec*proc

internal =   8.31 sec*proc

kernel =   5.30 sec*proc

level0 =   8.31 sec*proc

regression =   1.05 sec*proc

runtime =   1.95 sec*proc

vulkan =   1.05 sec*proc

Steps that I setup NVIDIA Driver and CUDA Toolkit

  1. Installed the NVIDIA Driver from NVIDIA's website.
  2. Verified that libcuda.so exists only in /usr/lib/wsl/lib/libcuda.so using find /usr/ -name libcuda.so.
  3. Followed the CUDA on WSL guide:
    1. Removed the existing key: sudo apt-key del 7fa2af80.
    2. Executed the following commands as per the CUDA 12.1.1 installation guide:
      wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
      sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
      wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
      sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
      sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
      sudo apt-get update
      sudo apt-get -y install cuda
  4. Set the PATH:
    echo 'export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}' >> ~/.bashrc
    echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}'  >> ~/.bashrc
    
    source ~/.bashrc
  5. Set the PATH:
    1. In Ubuntu 24.04, execute sudo nano /etc/wsl.conf and add the following to wsl.conf:
      # Prevent inheriting Windows PATH
      [interop]
      appendWindowsPath = false
    2. In Windows PowerShell, execute:
      wsl.exe --shutdown
  6. Verified that libcuda.so exists in both /usr/lib/wsl/lib/libcuda.so and /usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so using find /usr/ -name libcuda.so.
  7. Installed essential build tools:
    sudo apt -y install build-essential gcc g++ make libtool texinfo dpkg-dev pkg-config gfortran
  8. Installed OpenBLAS following the OpenBLAS Wiki:
    sudo apt update
    apt search openblas
    sudo apt install libopenblas-dev

PoCL Installation

I ultimately installed PoCL using the following method. I repeated the installation several times with different settings, ensuring to run xargs rm < install_manifest.txt in the pocl-6.0/build directory and deleting the pocl-6.0/build directory before each reinstallation.

  1. Executed the following commands to install PoCL as per the official PoCL installation guide:
    export LLVM_VERSION=18
    apt install -y python3-dev libpython3-dev build-essential ocl-icd-libopencl1 \
        cmake git pkg-config libclang-${LLVM_VERSION}-dev clang-${LLVM_VERSION} \
        llvm-${LLVM_VERSION} make ninja-build ocl-icd-libopencl1 ocl-icd-dev \
        ocl-icd-opencl-dev libhwloc-dev zlib1g zlib1g-dev clinfo dialog apt-utils \
        libxml2-dev libclang-cpp${LLVM_VERSION}-dev libclang-cpp${LLVM_VERSION} \
        llvm-${LLVM_VERSION}-dev
  2. Downloaded PoCL:
    wget https://github.com/pocl/pocl/archive/refs/tags/v6.0.tar.gz
  3. Extracted the tarball:
    tar -xzvf v6.0.tar.gz
  4. Changed to the PoCL directory:
    cd pocl-6.0
  5. Created a build directory:
    mkdir build
  6. Built PoCL following the instructions from the GitHub issue comment:
    cmake -B build \
        -DCMAKE_C_FLAGS=-L/usr/lib/wsl/lib \
        -DCMAKE_CXX_FLAGS=-L/usr/lib/wsl/lib \
        -DENABLE_HOST_CPU_DEVICES=OFF \ # Having both CPU and GPU simultaneously can cause issues https://github.com/pocl/pocl/issues/853#issuecomment-696367623
        -DENABLE_CUDA=ON \
        -DWITH_LLVM_CONFIG=/usr/bin/llvm-config-${LLVM_VERSION} \ # https://forums.developer.nvidia.com/t/need-support-to-run-opencl-application-on-tx2-board/264420/4
        -DENABLE_EXAMPLES=ON # To install CUDA test: NVIDIA GPU support — Portable Computing Language (PoCL) 6.0 documentation https://portablecl.org/docs/html/cuda.html#run-tests
  7. Compiled PoCL:
    cmake --build build -j34
  8. Added environment variables to .bashrc:
    echo 'export POCL_BUILDING=1' >> ~/.bashrc
    echo 'export OCL_ICD_VENDORS=<FULL_PATH_OF_MY_HOME_DIR>/pocl-6.0/build/ocl-vendors/' >> ~/.bashrc
    
    sudo nano /etc/OpenCL/vendors/nvidia.icd
    # Default is `/libnvidia-opencl.so` but there is no such a file in that path.
    # Therefore, replace the default line with the following line:
    # export /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.90.07
    
    source ~/.bashrc
  9. Installed PoCL:
    cmake --install build
  10. Verified GPU recognition with clinfo --list:
    $ clinfo --list
    
    Platform #0: Portable Computing Language
        `-- Device #0: NVIDIA GeForce RTX 3060
  11. Ran PoCL tests:
    cd ~/pocl-6.0/build
    ../tools/scripts/run_cuda_tests
    
    # For rerunning the failed tests:
    ../tools/scripts/run_cuda_tests --rerun-failed --output-on-failure
@pjaaskel pjaaskel added the CUDA label Aug 13, 2024
@pjaaskel
Copy link
Member

Thanks for your extensive report. CUDA is still an experimental driver at rather early state without very active progress, as you can see. @isuruf do you have insights? Is this WSL2-specific, that is, do those tests pass on "bare-metal" Linux?

@CLRafaelR
Copy link
Author

@pjaaskel

Thank you very much for your assistance.

I have also left a comment on the aforementioned issue, asking whether others who attempted PoCL installation on WSL2 using the same method experienced success with ../tools/scripts/run_cuda_tests. I hope this will help determine if the issue is unique to me or specific to WSL2...

@pjaaskel
Copy link
Member

Note that we do have a CUDA CI bot: https://github.com/pocl/pocl/actions/runs/10354582711/job/28664704111 and it has passed the simple smoke test suite, but the card we use in that box is quite old.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants