Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow/test usage in PBS/Slurm #144

Open
romeokienzler opened this issue Sep 4, 2024 · 17 comments
Open

Allow/test usage in PBS/Slurm #144

romeokienzler opened this issue Sep 4, 2024 · 17 comments
Assignees

Comments

@romeokienzler
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
TT not usable via SLURM/PBS

Describe the solution you'd like
Allow/test usage in PBS/Slurm

reported by @biancazadrozny

@Joao-L-S-Almeida
Copy link
Member

I can test it, but I need to have access to a SLURM-based resource.

@Foxigod
Copy link
Contributor

Foxigod commented Oct 8, 2024

I believe I've almost exclusively run terratorch through SLURM. Can you @romeokienzler elaborate on this issue?

@biancazadrozny
Copy link
Collaborator

@Foxigod Have you used multiple GPUs?

@Foxigod
Copy link
Contributor

Foxigod commented Oct 8, 2024

@biancazadrozny Ahh, yes I have, but I did need to modify my submission script.
In my .yaml config I kept the trainer.devices: auto as the examples did, but I needed to explicitly export CUDA_VISIBLE_DEVICES=0,1,2,3 before calling terratorch fit ... to run on the 4 GPUs I have per node (or at least that seemed to work, irrespective of its necessity).
With terratorch I never gave multi-node experiments a try however, so I can't speak for that part.

@Foxigod
Copy link
Contributor

Foxigod commented Oct 16, 2024

I just experimented with 2 nodes, and it seems to have worked. I have 4 GPUs per node, and this was amongst the printouts

0: ----------------------------------------------------------------------------------------------------
0: distributed_backend=nccl
0: All distributed processes registered. Starting with 8 processes
0: ----------------------------------------------------------------------------------------------------

@romeokienzler
Copy link
Collaborator Author

@takaomoriyama can u pse verify and repeat the scale out and scale up tests u did on ccc?

@takaomoriyama
Copy link
Member

@romeokienzler Sure!

@romeokienzler
Copy link
Collaborator Author

Samy from FZJ managed to run TT on jewles using Slurm

@romeokienzler
Copy link
Collaborator Author

@takaomoriyama has access to jewles now, implementing...

@romeokienzler
Copy link
Collaborator Author

related to #146

@takaomoriyama
Copy link
Member

takaomoriyama commented Nov 14, 2024

Created a batch script for Slurm #234.
The current result of sen1floods11_vit workload.

<num_nodes> x <num_gpus> - Execution time / error
------------------------------------------------------
1x1 - 175m32.861s
1x2 - 87m59.467s
1x4 - 46m28.186s
2x1 - Network error: TCPStore() (1/2 clients has joined)
2x2 - 48m45.038s
4x4 - Network error: TCPStore() (4/16 clients has joined)
8x4 - Error: No training batch
16x4 - Error: No training batch
32x4 - Network error: TCPStore() (80/128 clients has joined)

So far, scaling up to 4 GPUs is OK, but suffering from two issues: intermittent network error and no training batch error.

@romeokienzler
Copy link
Collaborator Author

@MinasMayth helps to get a contact at FZJ to re-run with more data...

@romeokienzler
Copy link
Collaborator Author

@takaomoriyama to re-run (because of outage)

@takaomoriyama
Copy link
Member

There reasons of errors in the table above.

[No data error]

8x4 - Error: No training batch
16x4 - Error: No training batch

These errors occurred because enough number of batches were not available for nodes.
Sen1floods11 data contains 252 files for training, and default batch_size is 16. So we have 252 / 16 = 16 batches. So if we have more than 16 tasks. I adjusted batch size so that all task will receive at least one batch. The result will be shown next comment.

[TCPStore() error]
This still occurs intermittently.

@MinasMayth
Copy link
Collaborator

Have not found anyone else who has been running terratorch at JSC except for @Foxigod. We are unsure how the nodes exactly communicate, i.e. if it is based on some node being recognized as a sort of "root-node" or not. If the underlying method used here doesn't account for the infiniband islands, then those could explain the TCPStore() error (suggested by Eli).

@takaomoriyama
Copy link
Member

Result sen1floods11_vit workload with batch_size 1 (terratorch --data.init_args.batch_size 1)
Ran on JUWELS cluster at FZJ.

<num_nodes>x<num_gpus> #batch/task  Execn time  Speed up
--------------------------------------------------------
          1x1                  252  87m17.542s     1.00x
          1x2                  126  50m49.549s     1.72x
          1x4                   63  29m34.590s     2.95x
          2x1                  126  59m43.249s     1.46x
          2x2                   63  30m38.232s     2.85x
          2x4                   32  19m43.904s     4.42x
          4x1                   63  35m14.097s     2.48x
          4x4                   16  13m25.957s     6.50x
          8x4                    8  11m11.514s     7.80x
         16x4                    4   9m27.094s     9.24x
         32x4                    2   9m53.797s     8.82x

@romeokienzler
Copy link
Collaborator Author

@takaomoriyama to test with larger data set, other than sen1floods11_vit

please ask @blumenstiel or @paolofraccaro - they have most probably other datasets, e.g., major tom or similar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants