Turing Tensor Core operations must be run on a machine with compute capability at least 75 #13

MultiPath · 2022-01-15T20:30:07Z

Hi

Thanks for sharing the great work. Although the code can be compiled, it fails when running the testbed.
Does it mean that there is no way to support running this code on some older GPUs, e.g. Tesla V100 which seems architecture 70.
Are there any alternatives?

Thanks

Tom94 · 2022-01-15T21:16:11Z

Hi there! While we don't officially support Volta GPUs, you might find success replacing FullyFusedMLP (which relies on Turing+ TensorCores) with the more general CutlassMLP in the network config.

(If no config is specified on the command line, the testbed uses configs/<mode>/base.json.)

MultiPath · 2022-01-15T23:39:46Z

It can run with CutlassMLP and comment out the raised error in the testbed.
However, I found the loss was completely frozen. I also tried tiny-nn-cuda repo and run the same test. The results look similar

For example the above image, loss=0.206 and never decreased.
I ran python scripts/run.py --scene data/image/ --mode image --network base.json

Tom94 · 2022-01-16T07:34:44Z

That's unfortunate. Apologies, but it's currently not in our scope to add dedicated implementations for older GPUs.

In part, because earlier GPU's performance would benefit much less from a fully fused implementation due to them being more compute rather than memory-bound than newer GPUs on the small-MLP workload.

All that said, I would still be more than happy to merge code contributions that improve compatibility.

MultiPath · 2022-01-16T19:28:47Z

Well, Volta100 also has 32GB memory which is not too bad, and should also be able to support Tensor Cores, but with compute 70. I am not sure which part restricts to use arch > 75.

All that said, I would still be more than happy to merge code contributions that improve compatibility.

I have no idea which caused the issue. Not sure where to start and would be nice to learn more from you @Tom94

Tom94 · 2022-01-16T20:13:59Z

tiny-cuda-nn's CutlassMLP had supported V100 tensor core ops once in the past (hence my suggestion above) and I am not 100% sure where it might have broken in the interim. I had stopped explicitly supporting it after all (few at the time) colleagues who depended on the framework moved on to newer GPUs.

The top of dependencies/tiny-cuda-nn/include/tiny-cuda-nn/cutlass_matmul.h contains a number of template configurations that are conditional on compute capability. Reason being that the tensor cores on different GPU generations behave subtly differently and support differently sized operands.

A shot in the dark, which would be amazing if it works: does the codebase compile when you replace

#ifdef TCNN_AMPERE
using SmArch = typename std::conditional<std::is_same<network_precision_t, float>::value, cutlass::arch::Sm75, cutlass::arch::Sm80>::type;
#else
using SmArch = cutlass::arch::Sm75;
#endif

with

using SmArch = cutlass::arch::Sm70;

? Perhaps (though unlikely) my old maze of compile-time conditionals survived the last year intact enough to still support Sm70.

Note: you likely also have to comment out lines 78/79 in dependencies/tiny-cuda-nn/CMakeLists.txt

# Remove unsupported architectures
list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES 53 60 61 70 "86+PTX")

as well as set the environment variable TCNN_CUDA_ARCHITECTURES=70 if your arch isn't detected right automatically. Fingers crossed!

If it turns out to be this simple, I'm of course happy to automate all of these manual steps upstream.

MultiPath · 2022-01-16T23:13:38Z

Thanks for the very detailed reply!

MultiPath · 2022-01-17T00:31:39Z

@Tom94

Confirmed that the above change (using SmArch = cutlass::arch::Sm70;) worked, and "comment out list(REMOVE_ITEM CMAKE_CUDA_ARCHITECTURES 53 670 "86+PTX")" seems not necessary.

I tested both CutlassMLP and FullyFusedMLP, on nerf fox (also on image fitting task) the speed is:
96.55step/s and 110.26step/s on my server, respectively.
Both losses deceased and learned reasonable output.
However, the results FullyFusedMLP has 2x larger loss than CutlassMLP, and the output image is also relatively worse.

Also, I also tested "tiny-cuda-nn" repo on its own. The example (https://github.com/NVlabs/tiny-cuda-nn#example-learning-a-2d-image) worked for CutlassMLP, however, the loss diverged for FullyFusedMLP.

So something is still different between CutlassMLP and FullyFusedMLP which the latter is still not working normally.

1kaiser · 2022-01-17T05:02:59Z

will it work on gtx 1650 max q turing architecture ?

Tom94 · 2022-01-17T06:53:12Z

@MultiPath wow that's great. Especially the fact that FullyFusedMLP works at least for instant-ngp out-of-the-box! I'm also positively surprised about the overall speed -- on my 3090 I get 169.03step/s, which not even 2x faster. Very glad that this works!

I'll head over to tiny-cuda-nn and automate this upon CMake detecting a 70-arch.

Regarding the failure in the 2D learning example: may I ask you test one more thing? Could you go into dependencies/tiny-cuda-nn/src/fully_fused_mlp.cu and replace the two occurrences of

	const int N_ITERS = WIDTH >= 256 ? 2 : 8;
	const uint32_t BLOCK_DIM_Z = (INFERENCE && WIDTH == 128) ? 2 : 1;

with

	const int N_ITERS = 1;
	const uint32_t BLOCK_DIM_Z = 1;

(lines 284 and 586)? In the off-chance that this helps, how far (in powers of two) can you increase N_ITERS until things stop working? Much appreciated!

MultiPath · 2022-01-17T07:15:56Z

Hi I tried and changed two places.
Unfortunately, it does not solve the issue I met in the 2d image learining example. The following is the screenshot:

The loss did not go down and diverge after 5k steps.

On the other hand. using cutlass MLP is fine:

Also, it would be nice to know the ETA you mentioned about the pytorch binding for tiny-cuda-nn (even just cutlassMLP forward/backward alone) would be super nice!
I think it would be super helpful for many existing NeRF-based systems. I tried to taka a read this weekend. Unfortunately, it is quite out of my experiences for code at this level...

Tom94 · 2022-01-17T07:19:29Z

Thanks for checking! I'll configure FullyFusedMLP to automatically fall back to CutlassMLP for arch 70 and lower in that case. This is unfortunately ~20% slower according to your numbers from above, but imho much better than sporadic failure. At least until we have a better grasp on the precise conditions of how and why FullyFusedMLP fails.

Regarding the PyTorch ETA: apologies for not being more forthcoming, but all I can say is "it's done when it's done". Could be quick -- or it could be a while -- hard for me to predict.

MultiPath · 2022-01-17T07:21:10Z

Thanks for the response!

Tom94 · 2022-01-17T09:33:44Z

Thanks again for the help with troubleshooting. I pushed a fix which should add compatibility to V100 GPUs as well as earlier ones (through regular full-precision matmuls powered by CUTLASS): #22

Even though CI successfully builds these and I can run them locally, I don't have GTX 1000-series, K80, or V100 GPUs available to test with. I'd appreciate a confirm/deny whether instant-ngp actually runs on any of these now.

myagues · 2022-01-17T19:23:25Z

Can confirm it runs, see #33.

endlesswho · 2022-01-18T03:47:15Z

Thanks again for the help with troubleshooting. I've got the following PR, which should add compatibility to V100 GPUs as well as earlier ones (through regular full-precision matmuls powered by CUTLASS): #22

Even though CI successfully builds these and I can run them locally, I don't have GTX 1000-series, K80, or V100 GPUs available to test with. I'd appreciate a confirm/deny whether instant-ngp actually runs on any of these now.

The newest version could run success in my V100 GPU.

pwais mentioned this issue Jan 15, 2022

Colab demo? / Headless server version? #6

Closed

Tom94 mentioned this issue Jan 16, 2022

Cuda compile errors #12

Closed

Tom94 closed this as completed Jan 16, 2022

This was referenced Jan 17, 2022

Compatibility with GPU arch 70 and lower NVlabs/tiny-cuda-nn#15

Merged

Compatibility with GPU capability 70 and below #22

Merged

Tom94 reopened this Jan 17, 2022

Tom94 closed this as completed Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turing Tensor Core operations must be run on a machine with compute capability at least 75 #13

Turing Tensor Core operations must be run on a machine with compute capability at least 75 #13

MultiPath commented Jan 15, 2022

Tom94 commented Jan 15, 2022 •

edited

Loading

MultiPath commented Jan 15, 2022

Tom94 commented Jan 16, 2022

MultiPath commented Jan 16, 2022 •

edited

Loading

Tom94 commented Jan 16, 2022 •

edited

Loading

MultiPath commented Jan 16, 2022

MultiPath commented Jan 17, 2022 •

edited

Loading

1kaiser commented Jan 17, 2022

Tom94 commented Jan 17, 2022 •

edited

Loading

MultiPath commented Jan 17, 2022

Tom94 commented Jan 17, 2022

MultiPath commented Jan 17, 2022

Tom94 commented Jan 17, 2022 •

edited

Loading

myagues commented Jan 17, 2022

endlesswho commented Jan 18, 2022

Turing Tensor Core operations must be run on a machine with compute capability at least 75 #13

Turing Tensor Core operations must be run on a machine with compute capability at least 75 #13

Comments

MultiPath commented Jan 15, 2022

Tom94 commented Jan 15, 2022 • edited Loading

MultiPath commented Jan 15, 2022

Tom94 commented Jan 16, 2022

MultiPath commented Jan 16, 2022 • edited Loading

Tom94 commented Jan 16, 2022 • edited Loading

MultiPath commented Jan 16, 2022

MultiPath commented Jan 17, 2022 • edited Loading

1kaiser commented Jan 17, 2022

Tom94 commented Jan 17, 2022 • edited Loading

MultiPath commented Jan 17, 2022

Tom94 commented Jan 17, 2022

MultiPath commented Jan 17, 2022

Tom94 commented Jan 17, 2022 • edited Loading

myagues commented Jan 17, 2022

endlesswho commented Jan 18, 2022

Tom94 commented Jan 15, 2022 •

edited

Loading

MultiPath commented Jan 16, 2022 •

edited

Loading

Tom94 commented Jan 16, 2022 •

edited

Loading

MultiPath commented Jan 17, 2022 •

edited

Loading

Tom94 commented Jan 17, 2022 •

edited

Loading

Tom94 commented Jan 17, 2022 •

edited

Loading