You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
flux_train_replicate: 100%|█████████▉| 2999/3000 [1:18:19<00:01, 1.57s/it, lr: 4.0e-04 loss: 2.634e-01]
Generating Images: 0%| | 0/3 [00:00<?, ?it/s]
Generating Images: 33%|███▎ | 1/3 [00:12<00:24, 12.45s/it]
Generating Images: 67%|██████▋ | 2/3 [00:24<00:12, 12.44s/it]
Generating Images: 100%|██████████| 3/3 [00:37<00:00, 12.44s/it]
Saved to output/flux_train_replicate/optimizer.pt
Saving weights to W&B: flux_train_replicate.safetensors
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
Thread SenderThread:
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 48, in run
self._run()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal_util.py", line 99, in _run
self._process(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/internal.py", line 327, in _process
self._sm.send(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 398, in send
send_handler(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 420, in send_request
send_handler(record)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/sdk/internal/sender.py", line 658, in send_request_defer
self._dir_watcher.finish()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 403, in finish
self._get_file_event_handler(file_path, save_name).finish()
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 181, in finish
self.on_modified(force=True)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 167, in on_modified
if self.current_size == 0:
File "/root/.pyenv/versions/3.10.15/lib/python3.10/site-packages/wandb/filesync/dir_watcher.py", line 133, in current_size
return os.path.getsize(self.file_path)
File "/root/.pyenv/versions/3.10.15/lib/python3.10/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '/src/wandb/run-20241205_114337-6n94u01o/files/output/flux_train_replicate/flux_train_replicate_000000100.safetensors'
wandb: ERROR Internal wandb error: file data was not synced
wandb: | 4926.841 MB of 4926.841 MB uploaded
wandb: / 4926.841 MB of 4926.841 MB uploaded
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
wandb: | 4926.841 MB of 4926.841 MB uploaded
wandb: / 4926.841 MB of 4926.841 MB uploaded
wandb: - 4926.841 MB of 4926.841 MB uploaded
wandb: \ 4926.841 MB of 4926.841 MB uploaded
wandb: ERROR Problem finishing run
output/flux_train_replicate/
output/flux_train_replicate/config.yaml
output/flux_train_replicate/lora.safetensors
output/flux_train_replicate/captions/
I see, that on every checkpoint, we remove the previous lora after sending the current one:
flux_train_replicate: 97%|█████████▋| 2899/3000 [1:15:46<02:39, 1.58s/it, lr: 4.0e-04 loss: 3.780e-01]
Removing old save: output/flux_train_replicate/flux_train_replicate_000002800.safetensors
Saving weights to W&B: flux_train_replicate_000002900.safetensors
But then, on finish, we try to send all of them again.
I think it's either we should configure wandb not to re-send all loras on finish or not to send them during the training and send them only on finish.
The text was updated successfully, but these errors were encountered:
I consistently get this error
I see, that on every checkpoint, we remove the previous lora after sending the current one:
But then, on finish, we try to send all of them again.
I think it's either we should configure wandb not to re-send all loras on finish or not to send them during the training and send them only on finish.
The text was updated successfully, but these errors were encountered: