-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad fp16 performance on "GTX 1650" cards #1670
Comments
Dumping
|
I went back to lc0 v0.21.4, which has a bunch of remarks related to GTX 16x0 bugfixes, and it still performs far worse in fp16 mode. This makes me wary of any driver regression. But I found something else that I find quite suspicious and alarming. The chip on this card, according to This makes me wonder if this is a cut down card where they disabled the Tensor Cores via the driver or fusing them off...but then forgot that the GTX1650 and 1660 are supposed to have a bunch of additional FP16 units (to make up for the missing Tensor Cores) that the RTX cards didn't have? Something else that is pointing me in this direction is that the card is drawing a whopping 22 Watts idling. A real GTX 1650 is supposed to draw like... 7 or 8 Watts, at most. |
The same card, using the lc0 release binaries and the Windows (rather than Linux) drivers, has about the same performance in fp16 and fp32 (cuda/cuda-fp16 or cudnn/cudnn-fp16 backend is almost the same), slightly slower than a GTX 1060. The large difference confirms my suspicion that the driver is trying to "emulate" the right kind of performance because it has Tensor Cores instead of fp16 units. Note that this is about half of what a GTX1650 is supposed to do in fp16 mode. I'll be returning this card as I obviously feel pretty scammed by this "fake" GTX1650. |
The [dx12] backend is also broken with this card (I was curious if the driver would "forget" that it is supposed to have no Tensor Cores - but it's broken differently), just hanging in backend creation. Edit: Oops, it worked eventually and is now running at 1/20th the expected speed. |
The vendor confirmed there's 2 nigh identical looking versions of this card: Obviously this part
Is only true for the latter card! |
tu106 has tensor cores, they're just throttled when used in a tu11x config. try hacking https://github.com/LeelaChessZero/lc0/blob/master/src/neural/cuda/network_cudnn.cc#L202 to see if anything helps, if not, file a bug with the hardware vendor |
We have had a report of atrocious performance on a GTX16x0 card with cuda 11.5 and cudnn 8.2.4, while cuda 10.2 with cudnn 7.4.2 (the ones we package) were working fine. Is it possible to try older versions? |
I tested the release packages as indicated above (on Windows, where it's easier to swap out the DLLs). See #1670 (comment). I'm still seeing only about half the expected performance. |
These are the default results:
Forcing fp32
With the proposed change to force-enable Tensor Cores:
|
@shizukachan Do you mean NVIDIA or ASUS here? If it's the latter, does this mean that e.g. they made some mistake in the BIOS and forgot to enable the Tensor Cores? |
I tested this on Linux too now, indeed it's the same as on Windows: stepping back to cudnn 7.6.5 "improves" the performance to be slightly slower than fp32, which is still about half what it should be. I wonder when it was reported to be "working fine", that just meant "not atrociously bad" instead of "where it should be" :-) |
I checked, it was with a TU117, so probably "where it should be" - when we did the test we were not troubleshooting a performance issue, just trying to see whether the new versions helped. |
If you haven't returned this card yet, can you try whether the following enables the tensor cores? --- a/src/neural/cuda/network_cudnn.cc
+++ b/src/neural/cuda/network_cudnn.cc
@@ -199,7 +199,8 @@ class CudnnNetwork : public Network {
// Some GPUs (GTX 16xx) are SM 7.5 but don't have tensor cores
// enabling TENSOR_OP_MATH or nhwc_ layout for them works but is
// very very slow (likely because the system emulates it).
- if (!strstr(deviceProp.name, "GTX 16")) {
+ if (!strstr(deviceProp.name, "GTX 16") ||
+ (deviceProp.pciDeviceID & ~0x7f) == 0x1f00) {
hasTensorCores = true;
nhwc_ = true;
} |
I tried this trick (unconditionally) above where it says |
We may have a workaround/fix in #1675 if you can still try it. |
I can't and the card was a TU106 anyway (not TU11x) - it was only half the normal speed even on the old 10.2 CUDA/7.4 cudnn (release package on Windows), as described above. |
lc0-v0.28.2 built from git
Network hanse-69722-vf2
If I force normal (not fp16 mode):
Notice the performance warning - in reality it's the other way around by a factor of what, x10? IIRC lc0 had GTX 16x0 series optimizations at some point, they must've been broken.
The same behavior happens with cuda/cuda-fp16 backends.
The text was updated successfully, but these errors were encountered: