You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue is that DirectML doesn't seem to properly allocate GPU memory, which leads to kernel crashes.
When I start a new task with TensorFlow, DirectML allocates 25,405 MB of memory to my GPU as shown in the logs:
2024-12-02 07:30:12.449250: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-02 07:30:12.451437: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2024-12-02 07:30:12.893424: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-12-02 07:30:12.893467: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2024-12-02 07:30:12.893490: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-12-02 07:30:13.111053: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
I monitored the dedicated and shared memory usage during training. At some point, when my dedicated memory (12 GB) is fully used and shared memory is around 6 GB, DirectML attempts to allocate more memory—likely up to 25 GB—and the kernel crashes every time at this "checkpoint."
The training tasks work fine as long as they don't consume more VRAM than necessary. To avoid crashes, I limit the size of the training data and the number of epochs, but this is a workaround, not a solution.
I’ve tried every possible method to manually manage GPU memory in TensorFlow, but DirectML always allocates 25,405 MB of memory.
I also attempted to use ROCm on Ubuntu, but the latest version isn’t compatible with my 6700 XT, even with some known "tricks." I might try an older version of ROCm at some point.
Has anyone else encountered this problem with the RX 6700 XT? Any suggestions for resolving this would be greatly appreciated.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
Hi everyone,
I'm experiencing an issue with DirectML. Here's the setup I'm using:
The issue is that DirectML doesn't seem to properly allocate GPU memory, which leads to kernel crashes.
When I start a new task with TensorFlow, DirectML allocates 25,405 MB of memory to my GPU as shown in the logs:
I monitored the dedicated and shared memory usage during training. At some point, when my dedicated memory (12 GB) is fully used and shared memory is around 6 GB, DirectML attempts to allocate more memory—likely up to 25 GB—and the kernel crashes every time at this "checkpoint."
The training tasks work fine as long as they don't consume more VRAM than necessary. To avoid crashes, I limit the size of the training data and the number of epochs, but this is a workaround, not a solution.
I’ve tried every possible method to manually manage GPU memory in TensorFlow, but DirectML always allocates 25,405 MB of memory.
I also attempted to use ROCm on Ubuntu, but the latest version isn’t compatible with my 6700 XT, even with some known "tricks." I might try an older version of ROCm at some point.
Has anyone else encountered this problem with the RX 6700 XT? Any suggestions for resolving this would be greatly appreciated.
Thanks in advance!
The text was updated successfully, but these errors were encountered: