Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DirectML Allocates Excessive Memory Exceeding the Capacity of Radeon RX 6700 XT GPU #673

Open
Hedredo opened this issue Dec 2, 2024 · 0 comments

Comments

@Hedredo
Copy link

Hedredo commented Dec 2, 2024

Hi everyone,

I'm experiencing an issue with DirectML. Here's the setup I'm using:

  • OS: Windows 11
  • IDE: VSCode 1.95
  • Python: 3.10
  • TensorFlow: tensorflow-cpu 2.10 + tensorflow-directml-plugin 0.4.0
  • GPU: AMD Radeon RX 6700 XT (Adrenalin 24.10.1, VRAM: 12,288 MB)

The issue is that DirectML doesn't seem to properly allocate GPU memory, which leads to kernel crashes.

When I start a new task with TensorFlow, DirectML allocates 25,405 MB of memory to my GPU as shown in the logs:

2024-12-02 07:30:12.449250: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-12-02 07:30:12.451437: I tensorflow/c/logging.cc:34] DirectML: creating device on adapter 0 (AMD Radeon RX 6700 XT)
Dropped Escape call with ulEscapeCode : 0x03007703
2024-12-02 07:30:12.893424: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-12-02 07:30:12.893467: W tensorflow/core/common_runtime/pluggable_device/pluggable_device_bfc_allocator.cc:37] Ignoring the value of TF_FORCE_GPU_ALLOW_GROWTH because force_memory_growth was requested by the device.
2024-12-02 07:30:12.893490: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 25405 MB memory) -> physical PluggableDevice (device: 0, name: DML, pci bus id: <undefined>)
2024-12-02 07:30:13.111053: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.

I monitored the dedicated and shared memory usage during training. At some point, when my dedicated memory (12 GB) is fully used and shared memory is around 6 GB, DirectML attempts to allocate more memory—likely up to 25 GB—and the kernel crashes every time at this "checkpoint."

The training tasks work fine as long as they don't consume more VRAM than necessary. To avoid crashes, I limit the size of the training data and the number of epochs, but this is a workaround, not a solution.

I’ve tried every possible method to manually manage GPU memory in TensorFlow, but DirectML always allocates 25,405 MB of memory.

I also attempted to use ROCm on Ubuntu, but the latest version isn’t compatible with my 6700 XT, even with some known "tricks." I might try an older version of ROCm at some point.

Has anyone else encountered this problem with the RX 6700 XT? Any suggestions for resolving this would be greatly appreciated.

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant