-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
W&B Network Usage Bug #5071
Comments
@glenn-jocher Thanks. this definitely seems like a problem. Can you confirm that there were only 4 runs? The workspace has a total of 34 runs, probably all of them are previous runs. Can you link the 4 runs that you think are responsible for this network traffic? |
@AyushExel yes these were only on the n1-n4 runs:
But it's possible the bug affects all the runs. These 4 are the only ones we trained on GCP with Docker (which we pay for ourselves), the others are on AWS instances with Docker, where we are still using our credits and haven't checked the bills yet for egress breakdowns. I think maybe the terminal output keeps growing due to the tqdm issue with Docker plus the 'Network error resolved' message being repeated, and wandb perhaps keeps trying to upload the full terminal output? This seems likely since the traffic (MB/s) increases linearly from the start of training until the end, when it drops to zero: I checked the runs and they have only 1 artifact, which is best.pt on training completion, so it's not from uploading checkpoints. |
@AyushExel also strange is that my W&B storage use shows 68GB for all projects, and only 6GB for this project (with the 34 runs). But GCP says network egress from this instance was 1700GB, just for these 4 models. |
@glenn-jocher Thanks for the detailed info. I have forwarded this to client engineers. I'll keep you updated. |
@glenn-jocher can you provide the zip of the run dir or the .wandb file for the run? It'll be helpful for finding the cause of this. |
@AyushExel ah, unfortunately I can't, the instance has since been deleted. This should be reproducible, but we'd have to pay the reproduce unfortunately. The steps are just:
|
@glenn-jocher ok thanks. I'll reproduce this and get back |
@AyushExel thanks! I'm really curious what happens. |
@AyushExel it's been a month since I opened this. This is a major bug that cost us $200 in network egress costs just for training 4 very small (i.e. yolov3-tiny models). I think you should raise this directly on the wandb repo and bring attention to it. The run is publically available here: |
@AyushExel to clarify, the amount of data wandb egressed off our instance is unbelievable: >1500 GB for 4 small models. Before any deeper integration can be considered serious issues like this and others like #2685 (comment) clearly need to be resolved. |
@AyushExel this is happening again right now. As I'm typing this AWS egress expenses are eating into our cloud credits because of this ongoing W&B network egress bug that's been open for over a month with no progress. W&B logging is literally becoming unusable due to this issue. |
@AyushExel I've raised this directly in wandb/wandb#2905 due to lack of progress here. |
@glenn-jocher this was being worked on. We take these issues very seriously and this was reported to the team at once. Some issues that are not always reproducible take more time fix, which is why I requested for the log files. @vwrj has responded with a fix on the fix in the client issue that you opened. It'll go out with the next release on Tuesday. He'd like to get on a call with you to know exactly how you run these scripts. Can you please coordinate to set up some time for that @glenn-jocher ? |
@glenn-jocher for the bugs that aren't easily reproducible, let's coordinate over our joint slack channel where I can pull in engineers from other teams, so you don't have to wait for the complete release cycle. Thanks! |
@AyushExel network usage bug appears resolved now based on tests yesterday. Closing issue! |
@AyushExel I've got some odd behavior here by W&B. I trained 4 (very small) models, only 4MB each, over 3 days on a GCP instance, and saw enormous network egress costs related to 1,400 GB of data sent from the instance. I'm assuming nearly all of this is W&B traffic, as there were no other processes running on the instance. The egress to W&B steadily rose over the training time and by the end of the training was steadily increasing around 8 MB/s, nonstop 24/7, dropping back to zero after training completion.
Yesterday October 5th GCP charged us $67 just for that day for this network traffic:
I've also been seeing steady errors on the console related to the W&B connection:
An example run from this project that caused the above problem is here. I can't imagine what would have caused these four runs to transmit 1,400 GB of data off the instance in just 3 days. Please send this to the team for review. Thank you.
https://wandb.ai/glenn-jocher/yolov5_v6/runs/1yyym1gq?workspace=user-glenn-jocher
The text was updated successfully, but these errors were encountered: