Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wandb final send fails #60

Open
alexeevit opened this issue Dec 5, 2024 · 0 comments
Open

Wandb final send fails #60

alexeevit opened this issue Dec 5, 2024 · 0 comments

Comments

@alexeevit
Copy link

alexeevit commented Dec 5, 2024

I consistently get this error

flux_train_replicate: 100%|█████████▉| 2999/3000 [1:18:19<00:01,  1.57s/it, lr: 4.0e-04 loss: 2.634e-01]
Generating Images:   0%|          | 0/3 [00:00<?, ?it/s]
Generating Images:  33%|███▎      | 1/3 [00:12<00:24, 12.45s/it]
Generating Images:  67%|██████▋   | 2/3 [00:24<00:12, 12.44s/it]
Generating Images: 100%|██████████| 3/3 [00:37<00:00, 12.44s/it]
Saved to output/flux_train_replicate/optimizer.pt
Saving weights to W&B: flux_train_replicate.safetensors
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
Thread SenderThread:
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 48, in run
self._run()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 99, in _run
self._process(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 327, in _process
self._sm.send(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 398, in send
send_handler(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 420, in send_request
send_handler(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 658, in send_request_defer
self._dir_watcher.finish()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 403, in finish
self._get_file_event_handler(file_path, save_name).finish()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 181, in finish
self.on_modified(force=True)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 167, in on_modified
if self.current_size == 0:
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 133, in current_size
return os.path.getsize(self.file_path)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/src/wandb/run-20241205_114337-6n94u01o/files/output/flux_train_replicate/flux_train_replicate_000000100.safetensors'
wandb: ERROR Internal wandb error: file data was not synced
wandb: | 4926.841 MB of 4926.841 MB uploaded
wandb: / 4926.841 MB of 4926.841 MB uploaded
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
wandb: | 4926.841 MB of 4926.841 MB uploaded
wandb: / 4926.841 MB of 4926.841 MB uploaded
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
wandb: ERROR Problem finishing run
output/flux_train_replicate/
output/flux_train_replicate/config.yaml
output/flux_train_replicate/lora.safetensors
output/flux_train_replicate/captions/

I see, that on every checkpoint, we remove the previous lora after sending the current one:

flux_train_replicate:  97%|█████████▋| 2899/3000 [1:15:46<02:39,  1.58s/it, lr: 4.0e-04 loss: 3.780e-01]
                                                                                                        
Removing old save: output/flux_train_replicate/flux_train_replicate_000002800.safetensors
Saving weights to W&B: flux_train_replicate_000002900.safetensors

But then, on finish, we try to send all of them again.
I think it's either we should configure wandb not to re-send all loras on finish or not to send them during the training and send them only on finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant