W&B Network Usage Bug #5071

glenn-jocher · 2021-10-07T00:29:55Z

@AyushExel I've got some odd behavior here by W&B. I trained 4 (very small) models, only 4MB each, over 3 days on a GCP instance, and saw enormous network egress costs related to 1,400 GB of data sent from the instance. I'm assuming nearly all of this is W&B traffic, as there were no other processes running on the instance. The egress to W&B steadily rose over the training time and by the end of the training was steadily increasing around 8 MB/s, nonstop 24/7, dropping back to zero after training completion.

Yesterday October 5th GCP charged us $67 just for that day for this network traffic:

I've also been seeing steady errors on the console related to the W&B connection:

An example run from this project that caused the above problem is here. I can't imagine what would have caused these four runs to transmit 1,400 GB of data off the instance in just 3 days. Please send this to the team for review. Thank you.
https://wandb.ai/glenn-jocher/yolov5_v6/runs/1yyym1gq?workspace=user-glenn-jocher

AyushExel · 2021-10-07T06:47:33Z

@glenn-jocher Thanks. this definitely seems like a problem. Can you confirm that there were only 4 runs? The workspace has a total of 34 runs, probably all of them are previous runs. Can you link the 4 runs that you think are responsible for this network traffic?

glenn-jocher · 2021-10-07T21:56:20Z

@AyushExel yes these were only on the n1-n4 runs:

yolov5n1
yolov5n2
yolov5n3
yolov5n4

But it's possible the bug affects all the runs. These 4 are the only ones we trained on GCP with Docker (which we pay for ourselves), the others are on AWS instances with Docker, where we are still using our credits and haven't checked the bills yet for egress breakdowns.

I think maybe the terminal output keeps growing due to the tqdm issue with Docker plus the 'Network error resolved' message being repeated, and wandb perhaps keeps trying to upload the full terminal output? This seems likely since the traffic (MB/s) increases linearly from the start of training until the end, when it drops to zero:

I checked the runs and they have only 1 artifact, which is best.pt on training completion, so it's not from uploading checkpoints.

glenn-jocher · 2021-10-07T22:00:18Z

@AyushExel also strange is that my W&B storage use shows 68GB for all projects, and only 6GB for this project (with the 34 runs). But GCP says network egress from this instance was 1700GB, just for these 4 models.

AyushExel · 2021-10-08T13:12:00Z

@glenn-jocher Thanks for the detailed info. I have forwarded this to client engineers. I'll keep you updated.

AyushExel · 2021-10-08T13:20:06Z

@glenn-jocher can you provide the zip of the run dir or the .wandb file for the run? It'll be helpful for finding the cause of this.

glenn-jocher · 2021-10-08T17:34:55Z

@AyushExel ah, unfortunately I can't, the instance has since been deleted. This should be reproducible, but we'd have to pay the reproduce unfortunately. The steps are just:

Launch instance (GCP N1-standard-8 with T4)
Pull and run Docker ultralytics/yolov5:latest
Login to W&B
Train model

AyushExel · 2021-10-08T17:37:01Z

@glenn-jocher ok thanks. I'll reproduce this and get back

glenn-jocher · 2021-10-08T18:10:34Z

@AyushExel thanks! I'm really curious what happens.

glenn-jocher · 2021-11-10T21:29:52Z

@AyushExel it's been a month since I opened this. This is a major bug that cost us $200 in network egress costs just for training 4 very small (i.e. yolov3-tiny models). I think you should raise this directly on the wandb repo and bring attention to it. The run is publically available here:
https://wandb.ai/glenn-jocher/yolov3

glenn-jocher · 2021-11-10T21:32:17Z

@AyushExel to clarify, the amount of data wandb egressed off our instance is unbelievable: >1500 GB for 4 small models. Before any deeper integration can be considered serious issues like this and others like #2685 (comment) clearly need to be resolved.

glenn-jocher · 2021-11-10T22:03:34Z

@AyushExel this is happening again right now. As I'm typing this AWS egress expenses are eating into our cloud credits because of this ongoing W&B network egress bug that's been open for over a month with no progress. W&B logging is literally becoming unusable due to this issue.

https://wandb.ai/glenn-jocher/yolov3

glenn-jocher · 2021-11-10T22:08:51Z

@AyushExel I've raised this directly in wandb/wandb#2905 due to lack of progress here.

AyushExel · 2021-11-11T07:45:55Z

@glenn-jocher this was being worked on. We take these issues very seriously and this was reported to the team at once. Some issues that are not always reproducible take more time fix, which is why I requested for the log files.

@vwrj has responded with a fix on the fix in the client issue that you opened. It'll go out with the next release on Tuesday. He'd like to get on a call with you to know exactly how you run these scripts. Can you please coordinate to set up some time for that @glenn-jocher ?

AyushExel · 2021-11-11T09:54:15Z

@glenn-jocher for the bugs that aren't easily reproducible, let's coordinate over our joint slack channel where I can pull in engineers from other teams, so you don't have to wait for the complete release cycle. Thanks!

glenn-jocher · 2021-11-22T22:52:57Z

@AyushExel network usage bug appears resolved now based on tests yesterday. Closing issue!

glenn-jocher added the bug Something isn't working label Oct 7, 2021

vwrj mentioned this issue Oct 18, 2021

[WB-6966] yolov5-network-usage-bug wandb/wandb#2786

Merged

glenn-jocher added the TODO High priority items label Oct 25, 2021

glenn-jocher assigned AyushExel Nov 10, 2021

glenn-jocher mentioned this issue Nov 10, 2021

[CLI] W&B Network Usage Bug -- 1.6TB of egress charges of GCP train 4 YOLOv5n models wandb/wandb#2905

Closed

glenn-jocher closed this as completed Nov 22, 2021

glenn-jocher removed the TODO High priority items label Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

W&B Network Usage Bug #5071

W&B Network Usage Bug #5071

glenn-jocher commented Oct 7, 2021 •

edited

Loading

AyushExel commented Oct 7, 2021

glenn-jocher commented Oct 7, 2021 •

edited

Loading

glenn-jocher commented Oct 7, 2021

AyushExel commented Oct 8, 2021

AyushExel commented Oct 8, 2021

glenn-jocher commented Oct 8, 2021 •

edited

Loading

AyushExel commented Oct 8, 2021

glenn-jocher commented Oct 8, 2021

glenn-jocher commented Nov 10, 2021 •

edited

Loading

glenn-jocher commented Nov 10, 2021

glenn-jocher commented Nov 10, 2021 •

edited

Loading

glenn-jocher commented Nov 10, 2021

AyushExel commented Nov 11, 2021

AyushExel commented Nov 11, 2021

glenn-jocher commented Nov 22, 2021

W&B Network Usage Bug #5071

W&B Network Usage Bug #5071

Comments

glenn-jocher commented Oct 7, 2021 • edited Loading

AyushExel commented Oct 7, 2021

glenn-jocher commented Oct 7, 2021 • edited Loading

glenn-jocher commented Oct 7, 2021

AyushExel commented Oct 8, 2021

AyushExel commented Oct 8, 2021

glenn-jocher commented Oct 8, 2021 • edited Loading

AyushExel commented Oct 8, 2021

glenn-jocher commented Oct 8, 2021

glenn-jocher commented Nov 10, 2021 • edited Loading

glenn-jocher commented Nov 10, 2021

glenn-jocher commented Nov 10, 2021 • edited Loading

glenn-jocher commented Nov 10, 2021

AyushExel commented Nov 11, 2021

AyushExel commented Nov 11, 2021

glenn-jocher commented Nov 22, 2021

glenn-jocher commented Oct 7, 2021 •

edited

Loading

glenn-jocher commented Oct 7, 2021 •

edited

Loading

glenn-jocher commented Oct 8, 2021 •

edited

Loading

glenn-jocher commented Nov 10, 2021 •

edited

Loading

glenn-jocher commented Nov 10, 2021 •

edited

Loading