Support for Tensorboard #65

FerdinandKlingenberg · 2021-09-03T08:03:00Z

Dear Rémi,

First of all, thank you for this great module!

I notice I have very low GPU utilization during my training, and I would like to monitor it using TensorBoard to find the bottleneck. I see issue #26 is closed, but I hope it can be done like you did for the SR4RS.

Thanks,
Ferdinand Klingenberg

vidlb · 2021-09-03T08:20:13Z

Hi @remicres , may be this repo could be useful : https://github.com/RustingSword/tensorboard_logger

remicres · 2021-09-06T15:37:22Z

Hi @FerdinandKlingenberg ,

The Tensorboard support (which is not developed yet) would be something to enable the user to follow the computed metrics (accuracy, precision, f-score, loss value, ...) during the training. I don't know if it would be useful to monitor the GPU/CPU usage.

Could you provide some additional information about your case? Which model do you use, batch size? GPU, OTBTF version, nvidia-docker runtime?
Have you tried to disable the streaming during training (training.usestreaming off and validation.usestreaming off in TensorflowModelTrain)? When using streaming, the patches are read on the fly (no memory footprint used) but it slows down computing because of I/O operations.

While TensorflowModelTrain is nice for educational purpose and to train small models quickly, you will definitely need move to python to train your models in your own way, to have more control over your process. In the incoming release of OTBTF-3.0 we will bring more pythonic stuff coming with Tensorflow-2. There are a lot of cool things in TF2, like the new Tensorflow profiler 😉

@vidlb this looks really useful to implement the feature in TensorflowModelTrain !

FerdinandKlingenberg · 2021-09-08T08:00:06Z

Hi @remicres,

Thanks for your reply!

You're right; TensorBoard doesn't show much helpful GPU info out of the box. It was something I came over when I googled how to improve GPU utilization: https://www.tensorflow.org/guide/gpu_performance_analysis. I was thinking the TensorBoard could be helpful. And yes, as you mention, it is with the use of the TensorBoard Profiler plugin. This one is not working with the current OTBTF/SR4RS because of the different TF versions, right?

On some cloud-free mosaiced Sentinel-2 data, I use the U-Net model with patch size 64 on some server software with a 16 CPU/250GB RAM setup. The GPU is the Tesla V100S-32GB, but the setup is relatively new for me, as it is using the VMware Bitfusion technology to share the GPU between users. Because of Bitfusion, I cannot use the NVIDIA Container Toolkit, which complexes a lot (e.g., nvidia-smi is not working, only with the function bitfusion smi). I managed to run the Docker Build from Vincent: BASE_IMG=nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04 with the latest Dockerfile, modified to use with Bitfusion. Big thanks to @vidlb for your Docker setup. It helped a lot to adapt with Bitfusion!
I tried now with the *.usestreaming off, which helped a tiny bit, but not much. The GPU utilization is hovering around 10%

I would also like to add the os.environ["TF_GPU_THREAD_MODE"] = "gpu_private", which maybe could improve, idk. Could this line be added the tricks.py or should it be some other file?

I still haven't dived into the TF python yet, but I will likely do so later. For now, I look forward to the OTBTF-3.0!

Thanks a lot,
Ferdinand Klingenberg

vidlb · 2021-09-08T08:42:26Z

Hi @FerdinandKlingenberg, may be you should try with TF 2.4.3 and CUDA 11.0.3 (with TF_CUDA_COMPUTE_CAPABILITIES=7.0) ?
Sometimes, the latest version is buggy... And you only need CUDA 11.2 / TF 2.5 if using RTX3000 or anything with COMPUTE_CAPABILITY=8.6.

But it's also possible that a low usage is due to your shared GPU setup, thus it would require some additional configuration on the server side...

Good luck !

FerdinandKlingenberg · 2022-01-13T20:15:35Z

Hi @vidlb, A bit late follow-up. I tried your suggestion earlier, unfortunately with no change. With some more testing with other models, I think it is instead related to the patch size/grid size I am using.

@remicres, my part of the problem is solved. I will let you decide if you want to close this issue, given the headline is still valid.

Thank you both a lot,
Ferdinand Klingenberg

remicres added the enhancement New feature or request label Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Tensorboard #65

Support for Tensorboard #65

FerdinandKlingenberg commented Sep 3, 2021

vidlb commented Sep 3, 2021

remicres commented Sep 6, 2021

FerdinandKlingenberg commented Sep 8, 2021

vidlb commented Sep 8, 2021 •

edited

Loading

FerdinandKlingenberg commented Jan 13, 2022

Support for Tensorboard #65

Support for Tensorboard #65

Comments

FerdinandKlingenberg commented Sep 3, 2021

vidlb commented Sep 3, 2021

remicres commented Sep 6, 2021

FerdinandKlingenberg commented Sep 8, 2021

vidlb commented Sep 8, 2021 • edited Loading

FerdinandKlingenberg commented Jan 13, 2022

vidlb commented Sep 8, 2021 •

edited

Loading