Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

W&B Network Usage Bug #5071

Closed
glenn-jocher opened this issue Oct 7, 2021 · 15 comments
Closed

W&B Network Usage Bug #5071

glenn-jocher opened this issue Oct 7, 2021 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 7, 2021

@AyushExel I've got some odd behavior here by W&B. I trained 4 (very small) models, only 4MB each, over 3 days on a GCP instance, and saw enormous network egress costs related to 1,400 GB of data sent from the instance. I'm assuming nearly all of this is W&B traffic, as there were no other processes running on the instance. The egress to W&B steadily rose over the training time and by the end of the training was steadily increasing around 8 MB/s, nonstop 24/7, dropping back to zero after training completion.
Screen Shot 2021-10-06 at 5 21 10 PM

Yesterday October 5th GCP charged us $67 just for that day for this network traffic:
Screen Shot 2021-10-06 at 5 25 35 PM

I've also been seeing steady errors on the console related to the W&B connection:
Screen Shot 2021-10-06 at 5 28 18 PM

An example run from this project that caused the above problem is here. I can't imagine what would have caused these four runs to transmit 1,400 GB of data off the instance in just 3 days. Please send this to the team for review. Thank you.
https://wandb.ai/glenn-jocher/yolov5_v6/runs/1yyym1gq?workspace=user-glenn-jocher

@glenn-jocher glenn-jocher added the bug Something isn't working label Oct 7, 2021
@AyushExel
Copy link
Contributor

@glenn-jocher Thanks. this definitely seems like a problem. Can you confirm that there were only 4 runs? The workspace has a total of 34 runs, probably all of them are previous runs. Can you link the 4 runs that you think are responsible for this network traffic?

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Oct 7, 2021

@AyushExel yes these were only on the n1-n4 runs:

  • yolov5n1
  • yolov5n2
  • yolov5n3
  • yolov5n4

But it's possible the bug affects all the runs. These 4 are the only ones we trained on GCP with Docker (which we pay for ourselves), the others are on AWS instances with Docker, where we are still using our credits and haven't checked the bills yet for egress breakdowns.

I think maybe the terminal output keeps growing due to the tqdm issue with Docker plus the 'Network error resolved' message being repeated, and wandb perhaps keeps trying to upload the full terminal output? This seems likely since the traffic (MB/s) increases linearly from the start of training until the end, when it drops to zero:

Screen Shot 2021-10-07 at 2 55 59 PM

I checked the runs and they have only 1 artifact, which is best.pt on training completion, so it's not from uploading checkpoints.

@glenn-jocher
Copy link
Member Author

@AyushExel also strange is that my W&B storage use shows 68GB for all projects, and only 6GB for this project (with the 34 runs). But GCP says network egress from this instance was 1700GB, just for these 4 models.

@AyushExel
Copy link
Contributor

@glenn-jocher Thanks for the detailed info. I have forwarded this to client engineers. I'll keep you updated.

@AyushExel
Copy link
Contributor

@glenn-jocher can you provide the zip of the run dir or the .wandb file for the run? It'll be helpful for finding the cause of this.

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Oct 8, 2021

@AyushExel ah, unfortunately I can't, the instance has since been deleted. This should be reproducible, but we'd have to pay the reproduce unfortunately. The steps are just:

  • Launch instance (GCP N1-standard-8 with T4)
  • Pull and run Docker ultralytics/yolov5:latest
  • Login to W&B
  • Train model

@AyushExel
Copy link
Contributor

@glenn-jocher ok thanks. I'll reproduce this and get back

@glenn-jocher
Copy link
Member Author

@AyushExel thanks! I'm really curious what happens.

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Nov 10, 2021

@AyushExel it's been a month since I opened this. This is a major bug that cost us $200 in network egress costs just for training 4 very small (i.e. yolov3-tiny models). I think you should raise this directly on the wandb repo and bring attention to it. The run is publically available here:
https://wandb.ai/glenn-jocher/yolov3

Screenshot 2021-11-10 at 22 27 00

@glenn-jocher
Copy link
Member Author

@AyushExel to clarify, the amount of data wandb egressed off our instance is unbelievable: >1500 GB for 4 small models. Before any deeper integration can be considered serious issues like this and others like #2685 (comment) clearly need to be resolved.

@glenn-jocher
Copy link
Member Author

glenn-jocher commented Nov 10, 2021

@AyushExel this is happening again right now. As I'm typing this AWS egress expenses are eating into our cloud credits because of this ongoing W&B network egress bug that's been open for over a month with no progress. W&B logging is literally becoming unusable due to this issue.

https://wandb.ai/glenn-jocher/yolov3
Screenshot 2021-11-10 at 23 01 19

@glenn-jocher
Copy link
Member Author

@AyushExel I've raised this directly in wandb/wandb#2905 due to lack of progress here.

@AyushExel
Copy link
Contributor

@glenn-jocher this was being worked on. We take these issues very seriously and this was reported to the team at once. Some issues that are not always reproducible take more time fix, which is why I requested for the log files.

@vwrj has responded with a fix on the fix in the client issue that you opened. It'll go out with the next release on Tuesday. He'd like to get on a call with you to know exactly how you run these scripts. Can you please coordinate to set up some time for that @glenn-jocher ?

@AyushExel
Copy link
Contributor

@glenn-jocher for the bugs that aren't easily reproducible, let's coordinate over our joint slack channel where I can pull in engineers from other teams, so you don't have to wait for the complete release cycle. Thanks!

@glenn-jocher
Copy link
Member Author

@AyushExel network usage bug appears resolved now based on tests yesterday. Closing issue!

@glenn-jocher glenn-jocher removed the TODO High priority items label Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants