Feat: Automatic GPU Switch #845

Steel-skull · 2024-10-30T13:32:52Z

Docker Windows GPU Passthrough

[this is not fully tested as im waiting for a gpu to come in]

Automated GPU management solution for Windows in Docker containers with NVIDIA GPU passthrough support. This project provides scripts and configurations to dynamically manage GPU binding between host and Docker containers, with support for multiple GPUs and audio devices.

Prerequisites

Unraid server (or Linux system with Docker)
NVIDIA GPU(s)
Docker and Docker Compose
VFIO-PCI support in kernel
NVIDIA drivers installed on host

Quick Start

Clone the repository:

git clone https://github.com/yourusername/docker-windows-gpu.git
cd docker-windows-gpu

Configure your environment:

# Set to your GPU ID(s), PCI address(es), or 'none'
add NVIDIA_VISIBLE_DEVICES=0

Start the container:

docker-compose up -d

Configuration

Environment Variables

NVIDIA_VISIBLE_DEVICES: Specify GPU(s) to use
- Single GPU: NVIDIA_VISIBLE_DEVICES=0
- Multiple GPUs: NVIDIA_VISIBLE_DEVICES=0,1
- PCI addresses: NVIDIA_VISIBLE_DEVICES=0000:03:00.0,0000:04:00.0
- No GPU: NVIDIA_VISIBLE_DEVICES=none

Docker Compose

The provided docker-compose.yml includes all necessary configurations for:

GPU passthrough
RDP access
KVM support
Network management
Persistent storage

Usage

Manual GPU Management (until I find a way to run pre and post stop, use it with user scripts)

Bind GPU to container:

NVIDIA_VISIBLE_DEVICES=0 /boot/config/plugins/user.scripts/gpu-switch.sh start windows

Release GPU:

NVIDIA_VISIBLE_DEVICES=0 /boot/config/plugins/user.scripts/gpu-switch.sh stop windows

Script Details

The gpu-switch.sh script handles:

GPU detection and validation
Driver management (NVIDIA ⟷ VFIO-PCI)
Audio device pairing
Docker container configuration
Error handling and logging

gpu switch version: 0.1 # Without GPU: NVIDIA_VISIBLE_DEVICES="" ./gpu-switch.sh start container_name # With single GPU: NVIDIA_VISIBLE_DEVICES="0" ./gpu-switch.sh start container_name # With multiple GPUs: NVIDIA_VISIBLE_DEVICES="0,1" ./gpu-switch.sh start container_name # With PCI addresses: NVIDIA_VISIBLE_DEVICES="0000:03:00.0,0000:04:00.0" ./gpu-switch.sh start container_name # Explicitly disable GPU: NVIDIA_VISIBLE_DEVICES="none" ./gpu-switch.sh start container_name

Steel-skull · 2024-10-30T13:52:15Z

have to modify the docker compose side as I was under the impression it supported pre-start and post-stop scripts but I misread and its post-start and pre-stop, ill need to find a new way to work this, script still works and can be implemented using user scripts in unraid.

[again tho im waiting on a gpu so i haven't been able to fully test it]

kroese · 2024-11-09T23:16:29Z

Very interesting work!! Did you already receive your GPU to test it?

JosueIsrael-prog · 2024-11-11T10:21:16Z

Very good

maksymdor · 2024-11-11T10:45:50Z

Hmm! Interesting

vinkay215 · 2024-11-11T17:47:35Z

gpu-switch.sh

+if ! check_gpu_needed; then
+    log "Continuing without GPU management"
+    exit 0
+fi


Instead of listing all containers, you can directly check the existence of the container using docker container inspect, which is more efficient since it only checks the specified container without scanning the entire list. Here’s how to replace that line:

if ! docker container inspect "$CONTAINER_NAME" > /dev/null 2>&1; then error_exit "Container $CONTAINER_NAME does not exist" fi

The docker container inspect command returns an error if the container does not exist, so you can use it to directly verify the container’s existence without listing all containers.

ill take a look at implementing this thanks for the ideas

vinkay215 · 2024-11-11T17:51:44Z

gpu-switch.sh

+}
+
+# Convert any GPU identifier to PCI address
+convert_to_pci_address() {


Incorporating these improvements, here’s the final optimized convert_to_pci_address fu

convert_to_pci_address() { local device="$1" local gpu_address="" if [[ "$device" =~ ^[0-9]+$ || "$device" =~ ^GPU-.*$ ]]; then # Convert GPU index or UUID to PCI address gpu_address=$(nvidia-smi --id="$device" --query-gpu=gpu_bus_id --format=csv,noheader 2>/dev/null | tr -d '[:space:]') else # Direct PCI address provided gpu_address="$device" fi # Check for valid output if [ -z "$gpu_address" ]; then error_exit "Failed to get PCI address for device: $device" fi # Standardize format echo "$gpu_address" | sed -e 's/0000://' -e 's/\./:/g' }

ill take a look at implementing this thanks for the ideas on this as well

merged with main

tl123987 · 2024-11-12T02:53:28Z

share failed? is there something wrong?

Steel-skull · 2024-11-14T03:59:50Z

Very interesting work!! Did you already receive your GPU to test it?

sadly no the one I ordered from ebay was extremely unstable (kept crashing my server when using it with ollama) so im waiting for my money back

Steel-skull · 2024-11-14T04:00:21Z

share failed? is there something wrong?

you will have to expand on this, i dont understand.

Update gpu-switch.sh

tl123987 · 2024-11-14T06:35:10Z

Looking forward to your completion, thank you, I hope there will be a complete tutorial in the future

ilolm · 2024-11-15T21:20:52Z

gpu-switch.sh

It's really handy man)

hzxie

Syntax error in gpu-switch.sh

hzxie · 2024-11-17T05:46:26Z

gpu-switch.sh

+        local gpu_address=$(convert_to_pci_address "$device")
+        if [ -z "$gpu_address" ]; then
+            error_exit "Failed to get PCI address for device: $device"
+        }


./gpu-switch.sh: line 64: syntax error near unexpected token `}' ./gpu-switch.sh: line 64: ` }'

hzxie · 2024-11-17T05:47:17Z

gpu-switch.sh

+        if [ -z "$gpu_audio_address" ]; then
+            log "Warning: No audio device found for GPU $gpu_address"
+            continue
+        }


./gpu-switch.sh: line 75: syntax error near unexpected token `}' ./gpu-switch.sh: line 75: ` }'

hzxie · 2024-11-17T05:51:18Z

I got the following error after running gpu-switch.sh.

[root@SLab-Mocap-Server 11 ]$ NVIDIA_VISIBLE_DEVICES=0 ./gpu-switch.sh start windows-11
GPU-SWITCH [2024-11-17 13:49:30]: Warning: No audio device found for GPU 000001:00:0
GPU-SWITCH [2024-11-17 13:49:30]: ERROR: No valid GPU devices found

Here's the output of lspci -v | grep -A 15 " NVIDIA "

01:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 2080 Ti] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: bus master, fast devsel, latency 0, IRQ 133
        Memory at a4000000 (32-bit, non-prefetchable) [size=16M]
        Memory at 90000000 (64-bit, prefetchable) [size=256M]
        Memory at a0000000 (64-bit, prefetchable) [size=32M]
        I/O ports at 5000 [size=128]
        Expansion ROM at a5000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Legacy Endpoint, IntMsgNum 0
        Capabilities: [100] Virtual Channel
        Capabilities: [250] Latency Tolerance Reporting
        Capabilities: [258] L1 PM Substates
        Capabilities: [128] Power Budgeting <?>
        Capabilities: [420] Advanced Error Reporting
--
01:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: bus master, fast devsel, latency 0, IRQ 17
        Memory at a5080000 (32-bit, non-prefetchable) [size=16K]
        Capabilities: [60] Power Management version 3
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, IntMsgNum 0
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel

01:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1) (prog-if 30 [XHCI])
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: fast devsel, IRQ 128
        Memory at a2000000 (64-bit, prefetchable) [size=256K]
        Memory at a2040000 (64-bit, prefetchable) [size=64K]
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, IntMsgNum 0
        Capabilities: [b4] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci

01:00.3 Serial bus controller: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. Device 2503
        Flags: bus master, fast devsel, latency 0, IRQ 130
        Memory at a5084000 (32-bit, non-prefetchable) [size=4K]
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Capabilities: [78] Express Endpoint, IntMsgNum 0
        Capabilities: [b4] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: nvidia-gpu
        Kernel modules: i2c_nvidia_gpu

02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 01)
        Subsystem: ASRock Incorporation Device 8125
        Flags: bus master, fast devsel, latency 0, IRQ 17
        I/O ports at 4000 [size=256]
        Memory at a5320000 (64-bit, non-prefetchable) [size=64K]

Steel-skull added 2 commits October 30, 2024 08:19

Update compose.yml

52758f4

Steel-skull mentioned this pull request Oct 30, 2024

GPU Passthrough #22

Open

Karinza38 approved these changes Nov 11, 2024

View reviewed changes

vinkay215 reviewed Nov 11, 2024

View reviewed changes

Steel-skull added 2 commits November 13, 2024 22:10

Update gpu-switch.sh

a77b400

Merge pull request #1 from Steel-skull/Steel-skull-patch-2

e8fa68e

Update gpu-switch.sh

Santhoshsanna1003 approved these changes Nov 14, 2024

View reviewed changes

ilolm approved these changes Nov 15, 2024

View reviewed changes

gpu-switch.sh

Copy link

ilolm Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really handy man)

hzxie suggested changes Nov 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Automatic GPU Switch #845

Feat: Automatic GPU Switch #845

Steel-skull commented Oct 30, 2024 •

edited

Loading

Steel-skull commented Oct 30, 2024

kroese commented Nov 9, 2024

JosueIsrael-prog commented Nov 11, 2024

maksymdor commented Nov 11, 2024

vinkay215 Nov 11, 2024 •

edited

Loading

Steel-skull Nov 14, 2024 •

edited

Loading

vinkay215 Nov 11, 2024

Steel-skull Nov 14, 2024

Steel-skull Nov 14, 2024

tl123987 commented Nov 12, 2024 •

edited

Loading

Steel-skull commented Nov 14, 2024

Steel-skull commented Nov 14, 2024

tl123987 commented Nov 14, 2024

ilolm Nov 15, 2024

hzxie left a comment

hzxie Nov 17, 2024

hzxie Nov 17, 2024

hzxie commented Nov 17, 2024 •

edited

Loading

Feat: Automatic GPU Switch #845

Are you sure you want to change the base?

Feat: Automatic GPU Switch #845

Conversation

Steel-skull commented Oct 30, 2024 • edited Loading

Docker Windows GPU Passthrough

Prerequisites

Quick Start

Configuration

Environment Variables

Docker Compose

Usage

Manual GPU Management (until I find a way to run pre and post stop, use it with user scripts)

Script Details

Steel-skull commented Oct 30, 2024

kroese commented Nov 9, 2024

JosueIsrael-prog commented Nov 11, 2024

maksymdor commented Nov 11, 2024

vinkay215 Nov 11, 2024 • edited Loading

Choose a reason for hiding this comment

Steel-skull Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

vinkay215 Nov 11, 2024

Choose a reason for hiding this comment

Steel-skull Nov 14, 2024

Choose a reason for hiding this comment

Steel-skull Nov 14, 2024

Choose a reason for hiding this comment

tl123987 commented Nov 12, 2024 • edited Loading

Steel-skull commented Nov 14, 2024

Steel-skull commented Nov 14, 2024

tl123987 commented Nov 14, 2024

ilolm Nov 15, 2024

Choose a reason for hiding this comment

hzxie left a comment

Choose a reason for hiding this comment

hzxie Nov 17, 2024

Choose a reason for hiding this comment

hzxie Nov 17, 2024

Choose a reason for hiding this comment

hzxie commented Nov 17, 2024 • edited Loading

Steel-skull commented Oct 30, 2024 •

edited

Loading

vinkay215 Nov 11, 2024 •

edited

Loading

Steel-skull Nov 14, 2024 •

edited

Loading

tl123987 commented Nov 12, 2024 •

edited

Loading

hzxie commented Nov 17, 2024 •

edited

Loading