Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Tensorboard #65

Open
FerdinandKlingenberg opened this issue Sep 3, 2021 · 5 comments
Open

Support for Tensorboard #65

FerdinandKlingenberg opened this issue Sep 3, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@FerdinandKlingenberg
Copy link

Dear Rémi,

First of all, thank you for this great module!

I notice I have very low GPU utilization during my training, and I would like to monitor it using TensorBoard to find the bottleneck. I see issue #26 is closed, but I hope it can be done like you did for the SR4RS.

Thanks,
Ferdinand Klingenberg

@vidlb
Copy link
Contributor

vidlb commented Sep 3, 2021

Hi @remicres , may be this repo could be useful : https://github.com/RustingSword/tensorboard_logger

@remicres
Copy link
Owner

remicres commented Sep 6, 2021

Hi @FerdinandKlingenberg ,

The Tensorboard support (which is not developed yet) would be something to enable the user to follow the computed metrics (accuracy, precision, f-score, loss value, ...) during the training. I don't know if it would be useful to monitor the GPU/CPU usage.

  • Could you provide some additional information about your case? Which model do you use, batch size? GPU, OTBTF version, nvidia-docker runtime?
  • Have you tried to disable the streaming during training (training.usestreaming off and validation.usestreaming off in TensorflowModelTrain)? When using streaming, the patches are read on the fly (no memory footprint used) but it slows down computing because of I/O operations.

While TensorflowModelTrain is nice for educational purpose and to train small models quickly, you will definitely need move to python to train your models in your own way, to have more control over your process. In the incoming release of OTBTF-3.0 we will bring more pythonic stuff coming with Tensorflow-2. There are a lot of cool things in TF2, like the new Tensorflow profiler 😉

@vidlb this looks really useful to implement the feature in TensorflowModelTrain !

@FerdinandKlingenberg
Copy link
Author

Hi @remicres,

Thanks for your reply!

You're right; TensorBoard doesn't show much helpful GPU info out of the box. It was something I came over when I googled how to improve GPU utilization: https://www.tensorflow.org/guide/gpu_performance_analysis. I was thinking the TensorBoard could be helpful. And yes, as you mention, it is with the use of the TensorBoard Profiler plugin. This one is not working with the current OTBTF/SR4RS because of the different TF versions, right?

  • On some cloud-free mosaiced Sentinel-2 data, I use the U-Net model with patch size 64 on some server software with a 16 CPU/250GB RAM setup. The GPU is the Tesla V100S-32GB, but the setup is relatively new for me, as it is using the VMware Bitfusion technology to share the GPU between users. Because of Bitfusion, I cannot use the NVIDIA Container Toolkit, which complexes a lot (e.g., nvidia-smi is not working, only with the function bitfusion smi). I managed to run the Docker Build from Vincent: BASE_IMG=nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04 with the latest Dockerfile, modified to use with Bitfusion. Big thanks to @vidlb for your Docker setup. It helped a lot to adapt with Bitfusion!
  • I tried now with the *.usestreaming off, which helped a tiny bit, but not much. The GPU utilization is hovering around 10%

I would also like to add the os.environ["TF_GPU_THREAD_MODE"] = "gpu_private", which maybe could improve, idk. Could this line be added the tricks.py or should it be some other file?

I still haven't dived into the TF python yet, but I will likely do so later. For now, I look forward to the OTBTF-3.0!

Thanks a lot,
Ferdinand Klingenberg

@vidlb
Copy link
Contributor

vidlb commented Sep 8, 2021

Hi @FerdinandKlingenberg, may be you should try with TF 2.4.3 and CUDA 11.0.3 (with TF_CUDA_COMPUTE_CAPABILITIES=7.0) ?
Sometimes, the latest version is buggy... And you only need CUDA 11.2 / TF 2.5 if using RTX3000 or anything with COMPUTE_CAPABILITY=8.6.

But it's also possible that a low usage is due to your shared GPU setup, thus it would require some additional configuration on the server side...

Good luck !

@FerdinandKlingenberg
Copy link
Author

Hi @vidlb, A bit late follow-up. I tried your suggestion earlier, unfortunately with no change. With some more testing with other models, I think it is instead related to the patch size/grid size I am using.

@remicres, my part of the problem is solved. I will let you decide if you want to close this issue, given the headline is still valid.

Thank you both a lot,
Ferdinand Klingenberg

@remicres remicres added the enhancement New feature or request label Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants