-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Active W&B-integration truncates output lines #2685
Comments
👋 Hello @Lechtr, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at [email protected]. RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
@Lechtr thanks for spotting this bug! I repeated your steps and I see the same truncated output. I'm not sure what the cause might be, but perhaps our W&B expert @AyushExel can help. |
@AyushExel it seems W&B is truncating output somehow in Colab. These orange areas are missing when logging is enabled. |
@glenn-jocher @Lechtr thanks for reporting. Looking into this now. |
Also, my first instinct was that this is a problem related to tqdm with wandb but I cannot reproduce this in any example other than the train.py script. A simpler reproducible example would be very helpful for engineers.
But this works fine. A simpler example would be very helpful. |
Could it be an issue related to a new version of wandb? I don't have time now, but will test it later. |
@Lechtr I have tested with the latest version of wandb on my local machine. It works fine. The problem seems to be only with colab |
@AyushExel It is related to the latest wandb release 0.10.24. |
@AyushExel In the changelog for 0.10.24 it says
so that seems to be the direction we are looking for. Especially this commit https://github.com/wandb/client/commit/f2b191e6c5073abec9eb4d8efe5a26c8cdd35d1d The following class in there sounds like a promising area to take a closer look at (line 72):
but I don't understand it at first glance now, maybe you have more experience there. Might also be an idea to open an issue at the wandb-project. |
@Lechtr a reproducible example will be more helpful here and then I can get CLI engineers involved. This seems like a specific tqdm+wandb relaed issue on colab as the tqdm progress bar and status is truncated. But I was not able to reproduce a simpler example. For example I tried this:
This works as intended but the output from train.py is truncated. Are you able to reproduce this on a simpler example so that it is easier to pin-point the source? |
@AyushExel Your script gives following output: 0.10.24:
0.10.23:
I assume it is not supposed to be different? |
@Lechtr yes. That's the expected output. The progress bar is not getting truncated. But in train.py it gets truncated. We need a minimal example that reproduces this issue. |
@Lechtr were you able to reproduce this bug on a simple example? If so, can you please share it here, it'll be helpful for debugging. |
@AyushExel no sorry, I could not find a simple example to reproduce. |
@AyushExel I think since it's reproducible in Colab you should probably notify the CLI team and point them to their changes highlighted by #2685 (comment), as they may have introduced a bug in notebook environments. |
@glenn-jocher @Lechtr I've filed a ticket to fix this. You can expect this to get resolved in the next release |
@AyushExel yay thanks! |
@AyushExel @Lechtr I confirm I can reproduce this, and also I should note that the truncated output is not logged to W&B either, so it's lost/invisible in both Colab and W&B. Also note print statements (white) work fine, logger.log() statements (red) are truncated. |
Yes. This should be addressed in the next release. Engineers are on it. |
@glenn-jocher This will be fixed by this PR in wandb CLI -> wandb/wandb#2111 |
@AyushExel should I remove the TODO and close this then? It looks like wandb/wandb#2111 is not merged yet though. |
@glenn-jocher I'll just confirm this with someone. |
@glenn-jocher The progress bar is missing on colabs even without wandb. I've tried it without running the wandb initialization cell and the progress bar is still missing. Here's what Fariz, the CLI engineer working on the PR had to day about this:
|
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
@AyushExel I've been running into an issue which I've been seeing most of the year. W&B runs with Docker do not produce correct outputs in the 'Logs' tab, as the tqdm progress bars are not handled correctly by W&B here. I couldn't find an open issue on this specifically, but I know we've discussed this before. Is there any solution to this, or one in the works? Let me know please, as this affects all of my runs (I train 100% in Docker), and complicates efforts to share runs, i.e. all of the v6.0 runs for our recent release are affected. |
@glenn-jocher I can get someone to look into this if this is reproducible. I tried running on docker and the outputs for me are fine https://wandb.ai/cayush/yoloV5/runs/2rs2bbqg/logs?workspace=user-cayush . I think this problem is of cloud terminals not handling the tqdm outputs correctly. I've also seen this happen a lot, but I think its because the terminal breaks and produces these broken lines. wandb just captures whatever is there in the terminal. |
@AyushExel to reproduce you can use the same command I used to train the official models in docker. I don't actually view any output, I use a redirect to dev null in background with no terminal window: python train.py --data coco.yaml --batch 64 --weights '' --cfg yolov5n.yaml --epochs 300 --img 640 --device 0 > /dev/null 2>&1 & Training usually proceeds fine for a 3-4 epochs before this error occurs, and then the rest of training displays the error. This might be related to the W&B memory leak error also if all of this output is being transmitted to W&B. |
@AyushExel I think I figured out what causes this. When training in Docker, exiting container leaving it running causes problems with console outputs for some reason t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/datasets:/usr/src/datasets $t
python train.py
# wait a few epochs
CTRL+P+Q # exit container and leave it running Not sure what to do about this, we can't leave consoles attached as they'll eventually disconnect at some point over a few days of training. EDIT: possible workaround: |
@AyushExel updated screenshot proving that W&B is not simply copying console output, W&B is introducing errors into the Logs. This is a GCP instance console window vs. what shows up in the W&B logs for the same training run. This is reproducible on every single training run I do with Docker. Runs are never logged correctly by W&B in Logs. |
@AyushExel I'm observing this same Logs output bug with W&B in Google Colab. So now I can confirm this appears in all of the environments that I train models in: Docker and Colab. I don't have any W&B runs that Log correctly. https://wandb.ai/glenn-jocher/test_VOC_ar_thr/runs/bxwvs7by/logs?workspace=user-glenn-jocher |
@glenn-jocher yes. I'll update you on this once I hear back from the team. |
@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix. |
@AyushExel great, glad to hear! I'll keep this open for now, let me know when the the fixes are in the latest pip package please. |
@AyushExel do you know the status of this? I've got an open TODO tag here, would be nice to remove if issue is resolved. |
@glenn-jocher the network egress issue fix is verified. |
@AyushExel thanks for looking into this! The Logs bug is really easy to reproduce, I think basically any training will show it eventually. You can use the official Colab notebook for example and train VOC models and the bug will appear after about an hour. Here's some code you can use to reproduce in Colab: # 1. Setup
!git clone https://github.com/ultralytics/yolov5 # clone
%cd yolov5
%pip install -qr requirements.txt # install
import torch
from yolov5 import utils
display = utils.notebook_init() # checks
# 2. Weights & Biases
%pip install -q wandb
import wandb
wandb.login()
# 3. Reproduce
!python train.py --batch 64 --weights yolov5n.pt --data VOC.yaml --epochs 50 --img 512 --nosave --hyp hyp.finetune.yaml --project yolov5_wandb_reproduce --name yolov5n Then you can observe the Colab output vs the W&B Log: |
@glenn-jocher Thanks. Yes, I was able to repro and handed over this with the log files to the CLI engineers. |
@glenn-jocher It's strange that it only starts to happen after a few mins. Like it doesn't usually happen with coco128 for, say 10 epochs |
@AyushExel yes it's strange! Actually I think it might correlate with the number of output lines (or tqdm updates) rather than with time, because when training VOC in Colab with --cache (on high-mem Colab instances) it shows up quickly, like in 15 min around 10 epochs in. If you don't cache it takes an hour to show up, which also seems to correspond to about 10 epochs. Seems like COCO128 with 300 epochs is just too small for it to show up (not enough tqdm updates). |
Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:
git fetch && git status -uno
to check andgit pull
to update repoIf this is a custom dataset/training question you must include your
train*.jpg
,test*.jpg
andresults.png
figures, or we can not help you. You can generate these withutils.plot_results()
.🐛 Bug
If W&B is activated, certain lines of output & progression bars are truncated.
Affected lines are for example cache progress and training epoch time/progress bar.
To Reproduce (REQUIRED)
Use YOLOv5-tutorial colab.
Execute:
Expected behavior
Expected output (with W&B off):
caching:
training:
Actual output (with W&B on):
caching:
training:
Environment
Google Colab YOLOv5-tutorial
Also on my personal colab with custom data.
Additional context
If W&B is de-/activated with
!wandb disabled
or!wandb enabled
it will turn off/on this behaviour.The text was updated successfully, but these errors were encountered: