-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow/test usage in PBS/Slurm #144
Comments
I can test it, but I need to have access to a SLURM-based resource. |
I believe I've almost exclusively run terratorch through SLURM. Can you @romeokienzler elaborate on this issue? |
@Foxigod Have you used multiple GPUs? |
@biancazadrozny Ahh, yes I have, but I did need to modify my submission script. |
I just experimented with 2 nodes, and it seems to have worked. I have 4 GPUs per node, and this was amongst the printouts
|
@takaomoriyama can u pse verify and repeat the scale out and scale up tests u did on ccc? |
@romeokienzler Sure! |
Samy from FZJ managed to run TT on jewles using Slurm |
@takaomoriyama has access to jewles now, implementing... |
related to #146 |
Created a batch script for Slurm #234.
So far, scaling up to 4 GPUs is OK, but suffering from two issues: intermittent network error and no training batch error. |
@MinasMayth helps to get a contact at FZJ to re-run with more data... |
@takaomoriyama to re-run (because of outage) |
There reasons of errors in the table above. [No data error]
These errors occurred because enough number of batches were not available for nodes. [TCPStore() error] |
Have not found anyone else who has been running terratorch at JSC except for @Foxigod. We are unsure how the nodes exactly communicate, i.e. if it is based on some node being recognized as a sort of "root-node" or not. If the underlying method used here doesn't account for the infiniband islands, then those could explain the TCPStore() error (suggested by Eli). |
Result sen1floods11_vit workload with batch_size 1 (
|
@takaomoriyama to test with larger data set, other than sen1floods11_vit please ask @blumenstiel or @paolofraccaro - they have most probably other datasets, e.g., major tom or similar |
Is your feature request related to a problem? Please describe.
TT not usable via SLURM/PBS
Describe the solution you'd like
Allow/test usage in PBS/Slurm
reported by @biancazadrozny
The text was updated successfully, but these errors were encountered: