Wandb final send fails #60

alexeevit · 2024-12-05T13:36:46Z

I consistently get this error

flux_train_replicate: 100%|█████████▉| 2999/3000 [1:18:19<00:01,  1.57s/it, lr: 4.0e-04 loss: 2.634e-01]
Generating Images:   0%|          | 0/3 [00:00<?, ?it/s]
Generating Images:  33%|███▎      | 1/3 [00:12<00:24, 12.45s/it]
Generating Images:  67%|██████▋   | 2/3 [00:24<00:12, 12.44s/it]
Generating Images: 100%|██████████| 3/3 [00:37<00:00, 12.44s/it]
Saved to output/flux_train_replicate/optimizer.pt
Saving weights to W&B: flux_train_replicate.safetensors
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
Thread SenderThread:
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 48, in run
self._run()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 99, in _run
self._process(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 327, in _process
self._sm.send(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 398, in send
send_handler(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 420, in send_request
send_handler(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 658, in send_request_defer
self._dir_watcher.finish()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 403, in finish
self._get_file_event_handler(file_path, save_name).finish()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 181, in finish
self.on_modified(force=True)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 167, in on_modified
if self.current_size == 0:
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 133, in current_size
return os.path.getsize(self.file_path)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/src/wandb/run-20241205_114337-6n94u01o/files/output/flux_train_replicate/flux_train_replicate_000000100.safetensors'
wandb: ERROR Internal wandb error: file data was not synced
wandb: | 4926.841 MB of 4926.841 MB uploaded
wandb: / 4926.841 MB of 4926.841 MB uploaded
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
wandb: | 4926.841 MB of 4926.841 MB uploaded
wandb: / 4926.841 MB of 4926.841 MB uploaded
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
wandb: ERROR Problem finishing run
output/flux_train_replicate/
output/flux_train_replicate/config.yaml
output/flux_train_replicate/lora.safetensors
output/flux_train_replicate/captions/

I see, that on every checkpoint, we remove the previous lora after sending the current one:

flux_train_replicate:  97%|█████████▋| 2899/3000 [1:15:46<02:39,  1.58s/it, lr: 4.0e-04 loss: 3.780e-01]
                                                                                                        
Removing old save: output/flux_train_replicate/flux_train_replicate_000002800.safetensors
Saving weights to W&B: flux_train_replicate_000002900.safetensors

But then, on finish, we try to send all of them again.
I think it's either we should configure wandb not to re-send all loras on finish or not to send them during the training and send them only on finish.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wandb final send fails #60

Wandb final send fails #60

alexeevit commented Dec 5, 2024 •

edited

Loading

Wandb final send fails #60

Wandb final send fails #60

Comments

alexeevit commented Dec 5, 2024 • edited Loading

alexeevit commented Dec 5, 2024 •

edited

Loading