Allow/test usage in PBS/Slurm #144

romeokienzler · 2024-09-04T13:08:57Z

Is your feature request related to a problem? Please describe.
TT not usable via SLURM/PBS

Describe the solution you'd like
Allow/test usage in PBS/Slurm

reported by @biancazadrozny

Joao-L-S-Almeida · 2024-09-11T20:44:43Z

I can test it, but I need to have access to a SLURM-based resource.

Foxigod · 2024-10-08T10:55:06Z

I believe I've almost exclusively run terratorch through SLURM. Can you @romeokienzler elaborate on this issue?

biancazadrozny · 2024-10-08T20:47:17Z

@Foxigod Have you used multiple GPUs?

Foxigod · 2024-10-08T22:31:00Z

@biancazadrozny Ahh, yes I have, but I did need to modify my submission script.
In my .yaml config I kept the trainer.devices: auto as the examples did, but I needed to explicitly export CUDA_VISIBLE_DEVICES=0,1,2,3 before calling terratorch fit ... to run on the 4 GPUs I have per node (or at least that seemed to work, irrespective of its necessity).
With terratorch I never gave multi-node experiments a try however, so I can't speak for that part.

Foxigod · 2024-10-16T15:19:47Z

I just experimented with 2 nodes, and it seems to have worked. I have 4 GPUs per node, and this was amongst the printouts

0: ----------------------------------------------------------------------------------------------------
0: distributed_backend=nccl
0: All distributed processes registered. Starting with 8 processes
0: ----------------------------------------------------------------------------------------------------

romeokienzler · 2024-10-17T14:13:11Z

@takaomoriyama can u pse verify and repeat the scale out and scale up tests u did on ccc?

takaomoriyama · 2024-10-23T08:12:47Z

@romeokienzler Sure!

romeokienzler · 2024-10-24T13:21:38Z

Samy from FZJ managed to run TT on jewles using Slurm

romeokienzler · 2024-10-24T13:22:00Z

@takaomoriyama has access to jewles now, implementing...

romeokienzler · 2024-11-07T14:27:13Z

related to #146

takaomoriyama · 2024-11-14T08:44:47Z

Created a batch script for Slurm #234.
The current result of sen1floods11_vit workload.

<num_nodes> x <num_gpus> - Execution time / error
------------------------------------------------------
1x1 - 175m32.861s
1x2 - 87m59.467s
1x4 - 46m28.186s
2x1 - Network error: TCPStore() (1/2 clients has joined)
2x2 - 48m45.038s
4x4 - Network error: TCPStore() (4/16 clients has joined)
8x4 - Error: No training batch
16x4 - Error: No training batch
32x4 - Network error: TCPStore() (80/128 clients has joined)

So far, scaling up to 4 GPUs is OK, but suffering from two issues: intermittent network error and no training batch error.

romeokienzler · 2024-11-14T14:31:15Z

@MinasMayth helps to get a contact at FZJ to re-run with more data...

romeokienzler · 2024-11-14T14:32:01Z

@takaomoriyama to re-run (because of outage)

takaomoriyama · 2024-11-21T09:05:16Z

There reasons of errors in the table above.

[No data error]

8x4 - Error: No training batch
16x4 - Error: No training batch

These errors occurred because enough number of batches were not available for nodes.
Sen1floods11 data contains 252 files for training, and default batch_size is 16. So we have 252 / 16 = 16 batches. So if we have more than 16 tasks. I adjusted batch size so that all task will receive at least one batch. The result will be shown next comment.

[TCPStore() error]
This still occurs intermittently.

MinasMayth · 2024-11-21T09:12:37Z

Have not found anyone else who has been running terratorch at JSC except for @Foxigod. We are unsure how the nodes exactly communicate, i.e. if it is based on some node being recognized as a sort of "root-node" or not. If the underlying method used here doesn't account for the infiniband islands, then those could explain the TCPStore() error (suggested by Eli).

takaomoriyama · 2024-11-21T09:14:17Z

Result sen1floods11_vit workload with batch_size 1 (terratorch --data.init_args.batch_size 1)
Ran on JUWELS cluster at FZJ.

<num_nodes>x<num_gpus> #batch/task  Execn time  Speed up
--------------------------------------------------------
          1x1                  252  87m17.542s     1.00x
          1x2                  126  50m49.549s     1.72x
          1x4                   63  29m34.590s     2.95x
          2x1                  126  59m43.249s     1.46x
          2x2                   63  30m38.232s     2.85x
          2x4                   32  19m43.904s     4.42x
          4x1                   63  35m14.097s     2.48x
          4x4                   16  13m25.957s     6.50x
          8x4                    8  11m11.514s     7.80x
         16x4                    4   9m27.094s     9.24x
         32x4                    2   9m53.797s     8.82x

romeokienzler · 2024-11-21T14:08:08Z

@takaomoriyama to test with larger data set, other than sen1floods11_vit

please ask @blumenstiel or @paolofraccaro - they have most probably other datasets, e.g., major tom or similar

romeokienzler assigned takaomoriyama Sep 4, 2024

Joao-L-S-Almeida added the MoSCoW-must label Sep 12, 2024

takaomoriyama mentioned this issue Nov 14, 2024

A batch script for running with multi-nodes / multi-GPUs on Slurm environment #234

Merged

takaomoriyama mentioned this issue Nov 16, 2024

Allow for multiple GPU training #143

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow/test usage in PBS/Slurm #144

Allow/test usage in PBS/Slurm #144

romeokienzler commented Sep 4, 2024

Joao-L-S-Almeida commented Sep 11, 2024

Foxigod commented Oct 8, 2024

biancazadrozny commented Oct 8, 2024

Foxigod commented Oct 8, 2024

Foxigod commented Oct 16, 2024

romeokienzler commented Oct 17, 2024

takaomoriyama commented Oct 23, 2024

romeokienzler commented Oct 24, 2024

romeokienzler commented Oct 24, 2024

romeokienzler commented Nov 7, 2024

takaomoriyama commented Nov 14, 2024 •

edited

Loading

romeokienzler commented Nov 14, 2024

romeokienzler commented Nov 14, 2024

takaomoriyama commented Nov 21, 2024

MinasMayth commented Nov 21, 2024

takaomoriyama commented Nov 21, 2024

romeokienzler commented Nov 21, 2024

Allow/test usage in PBS/Slurm #144

Allow/test usage in PBS/Slurm #144

Comments

romeokienzler commented Sep 4, 2024

Joao-L-S-Almeida commented Sep 11, 2024

Foxigod commented Oct 8, 2024

biancazadrozny commented Oct 8, 2024

Foxigod commented Oct 8, 2024

Foxigod commented Oct 16, 2024

romeokienzler commented Oct 17, 2024

takaomoriyama commented Oct 23, 2024

romeokienzler commented Oct 24, 2024

romeokienzler commented Oct 24, 2024

romeokienzler commented Nov 7, 2024

takaomoriyama commented Nov 14, 2024 • edited Loading

romeokienzler commented Nov 14, 2024

romeokienzler commented Nov 14, 2024

takaomoriyama commented Nov 21, 2024

MinasMayth commented Nov 21, 2024

takaomoriyama commented Nov 21, 2024

romeokienzler commented Nov 21, 2024

takaomoriyama commented Nov 14, 2024 •

edited

Loading