Active W&B-integration truncates output lines #2685

Lechtr · 2021-04-01T17:13:46Z

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

Current repo: run git fetch && git status -uno to check and git pull to update repo
Common dataset: coco.yaml or coco128.yaml
Common environment: Colab, Google Cloud, or Docker image. See https://github.com/ultralytics/yolov5#environments

If this is a custom dataset/training question you must include your train*.jpg, test*.jpg and results.png figures, or we can not help you. You can generate these with utils.plot_results().

🐛 Bug

If W&B is activated, certain lines of output & progression bars are truncated.
Affected lines are for example cache progress and training epoch time/progress bar.

To Reproduce (REQUIRED)

Use YOLOv5-tutorial colab.
Execute:

Setup
Download COCO128
Weights & Biases
Train YOLOv5s on COCO128 for 3 epochs

Expected behavior

Expected output (with W&B off):
caching:

train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 1281314.83it/s]
train: Caching images (0.1GB): 100% 128/128 [00:00<00:00, 203.89it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 613566.76it/s]
val: Caching images (0.1GB): 100% 128/128 [00:00<00:00, 147.05it/s]

training:

Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       1/2     6.65G   0.04431   0.06402     0.019    0.1273       166       640: 100% 8/8 [00:01<00:00,  5.20it/s]
               Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100% 4/4 [00:01<00:00,  3.30it/s]
                 all         128         929       0.663       0.625       0.658       0.432

Actual output (with W&B on):
caching:

train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 fou
train: Caching images (0.1GB): 100%|█████████| 128/128 [00:00<00:00, 202.00it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found
val: Caching images (0.1GB): 100%|███████████| 128/128 [00:00<00:00, 144.14it/s]

training:

Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       0/2     3.29G   0.04237   0.06416   0.02121    0.1277       183       640
               Class      Images      Labels           P           R      [email protected]
                 all         128         929       0.641       0.637       0.661       0.431

Environment

Google Colab YOLOv5-tutorial
Also on my personal colab with custom data.

Additional context

If W&B is de-/activated with !wandb disabled or !wandb enabled it will turn off/on this behaviour.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-04-01T17:14:30Z

👋 Hello @Lechtr, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at [email protected].

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2021-04-02T11:21:12Z

@Lechtr thanks for spotting this bug! I repeated your steps and I see the same truncated output. I'm not sure what the cause might be, but perhaps our W&B expert @AyushExel can help.

glenn-jocher · 2021-04-02T11:26:39Z

@AyushExel it seems W&B is truncating output somehow in Colab. These orange areas are missing when logging is enabled.

AyushExel · 2021-04-02T11:53:08Z

@glenn-jocher @Lechtr thanks for reporting. Looking into this now.
I was able to reproduce this only on colab. Can you also reproduce this on a local machine? For me, it works as expected on Linux Desktop.
I'm trying to get to the root cause of the problem. I'll update soon

AyushExel · 2021-04-02T11:57:49Z

Also, my first instinct was that this is a problem related to tqdm with wandb but I cannot reproduce this in any example other than the train.py script. A simpler reproducible example would be very helpful for engineers.
I tried this:

import wandb
from tqdm import tqdm

run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
  print(x)

But this works fine. A simpler example would be very helpful.

Lechtr · 2021-04-02T14:16:37Z

Could it be an issue related to a new version of wandb?
That might explain why it doesn't work in Colab, because there you have the newest version, but works on your local machine, where you might have an older version installed.

I don't have time now, but will test it later.

AyushExel · 2021-04-02T14:25:18Z

@Lechtr I have tested with the latest version of wandb on my local machine. It works fine. The problem seems to be only with colab

Lechtr · 2021-04-02T19:48:01Z

@AyushExel It is related to the latest wandb release 0.10.24.
If i try out the version before, 0.10.23, with %pip install -Iv wandb==0.10.23 , it works fine.
I will check the changes made in the new release, maybe I will find the problem.

Lechtr · 2021-04-02T20:12:30Z

@AyushExel In the changelog for 0.10.24 it says

Significant fixes to stdout/stderr console logging

so that seems to be the direction we are looking for.

Especially this commit https://github.com/wandb/client/commit/f2b191e6c5073abec9eb4d8efe5a26c8cdd35d1d
made some changes in the file wandb/sdk/internal/file_stream.py that seem interesting :

The following class in there sounds like a promising area to take a closer look at (line 72):

class CRDedupeFilePolicy(DefaultFilePolicy):
    """File stream policy that removes characters that would be erased by
    carriage returns.
    This is what a terminal does. We use it for console output to reduce the
    amount of data we need to send over the network (eg. for progress bars),
    while preserving the output's appearance in the web app.
    """

but I don't understand it at first glance now, maybe you have more experience there.

Might also be an idea to open an issue at the wandb-project.

AyushExel · 2021-04-02T20:17:23Z

@Lechtr a reproducible example will be more helpful here and then I can get CLI engineers involved. This seems like a specific tqdm+wandb relaed issue on colab as the tqdm progress bar and status is truncated. But I was not able to reproduce a simpler example. For example I tried this:

import wandb
from tqdm import tqdm

run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
  print(x)

This works as intended but the output from train.py is truncated. Are you able to reproduce this on a simpler example so that it is easier to pin-point the source?

Lechtr · 2021-04-02T20:35:35Z

@AyushExel Your script gives following output:

0.10.24:

  0%|                                                     | 0/3 [00:00<?, ?it/s]10
100
1000
100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 3293.09it/s]

0.10.23:

10
100
1000
100%|██████████| 3/3 [00:00<00:00, 5291.38it/s]

I assume it is not supposed to be different?

AyushExel · 2021-04-02T20:37:36Z

@Lechtr yes. That's the expected output. The progress bar is not getting truncated. But in train.py it gets truncated. We need a minimal example that reproduces this issue.

AyushExel · 2021-04-03T19:53:35Z

@Lechtr were you able to reproduce this bug on a simple example? If so, can you please share it here, it'll be helpful for debugging.
If not, I can get someone to look into it nevertheless but that'll happen on Monday

Lechtr · 2021-04-04T05:57:47Z

@AyushExel no sorry, I could not find a simple example to reproduce.

glenn-jocher · 2021-04-05T10:49:32Z

@AyushExel I think since it's reproducible in Colab you should probably notify the CLI team and point them to their changes highlighted by #2685 (comment), as they may have introduced a bug in notebook environments.

AyushExel · 2021-04-06T14:36:53Z

@glenn-jocher @Lechtr I've filed a ticket to fix this. You can expect this to get resolved in the next release

glenn-jocher · 2021-04-06T14:56:59Z

@AyushExel yay thanks!

glenn-jocher · 2021-04-08T22:00:29Z

@AyushExel @Lechtr I confirm I can reproduce this, and also I should note that the truncated output is not logged to W&B either, so it's lost/invisible in both Colab and W&B. Also note print statements (white) work fine, logger.log() statements (red) are truncated.
https://wandb.ai/glenn-jocher/YOLOv5/runs/23u3uq0x/logs?workspace=user-glenn-jocher

AyushExel · 2021-04-08T22:03:34Z

Yes. This should be addressed in the next release. Engineers are on it.

AyushExel · 2021-04-26T17:19:34Z

@glenn-jocher This will be fixed by this PR in wandb CLI -> wandb/wandb#2111

glenn-jocher · 2021-05-10T15:04:40Z

@AyushExel should I remove the TODO and close this then? It looks like wandb/wandb#2111 is not merged yet though.

AyushExel · 2021-05-11T07:03:54Z

@glenn-jocher I'll just confirm this with someone.

AyushExel · 2021-05-23T16:24:17Z

@glenn-jocher The progress bar is missing on colabs even without wandb. I've tried it without running the wandb initialization cell and the progress bar is still missing.

Here's what Fariz, the CLI engineer working on the PR had to day about this:

running it without wandb (or `WANDB_CONSOLE="off"`)? I'm sure the progress bar wont be printed in that case either. 
We should be matching the behavior without wandb when wandb is initialized as well.
Now a bit more post-mortem on why the progress bar is visible when older versions of wandb are initialized but not with 
current branch or when wandb is not initialized.. TQDM prints the progressbar based on window width returned by an ioctl
 system call. This would be zero for subprocesses spawned in colab and would lead to a zero width progress bar in such
 cases.
 In older versions of wandb we fallback to 80 if the width is 0 (https://github.com/wandb/client/blob/master/wandb/sdk/lib/redirect.py#L573), 
which causes the progressbar to render.

github-actions · 2021-06-23T00:08:51Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

glenn-jocher · 2021-10-25T14:59:22Z

@AyushExel I've been running into an issue which I've been seeing most of the year. W&B runs with Docker do not produce correct outputs in the 'Logs' tab, as the tqdm progress bars are not handled correctly by W&B here. I couldn't find an open issue on this specifically, but I know we've discussed this before. Is there any solution to this, or one in the works? Let me know please, as this affects all of my runs (I train 100% in Docker), and complicates efforts to share runs, i.e. all of the v6.0 runs for our recent release are affected.
https://wandb.ai/glenn-jocher/yolov5_v6/runs/1yyym1gq/logs?workspace

AyushExel · 2021-10-25T15:51:27Z

@glenn-jocher I can get someone to look into this if this is reproducible. I tried running on docker and the outputs for me are fine https://wandb.ai/cayush/yoloV5/runs/2rs2bbqg/logs?workspace=user-cayush .

I think this problem is of cloud terminals not handling the tqdm outputs correctly. I've also seen this happen a lot, but I think its because the terminal breaks and produces these broken lines. wandb just captures whatever is there in the terminal.
If this happens due to wandb, then its definitely a bug. But I don't think that's the case. I might be wrong though

glenn-jocher · 2021-10-26T12:45:38Z

@AyushExel to reproduce you can use the same command I used to train the official models in docker. I don't actually view any output, I use a redirect to dev null in background with no terminal window:

python train.py --data coco.yaml --batch 64 --weights '' --cfg yolov5n.yaml --epochs 300 --img 640 --device 0  > /dev/null 2>&1 &

Training usually proceeds fine for a 3-4 epochs before this error occurs, and then the rest of training displays the error. This might be related to the W&B memory leak error also if all of this output is being transmitted to W&B.

glenn-jocher · 2021-10-27T14:12:34Z

@AyushExel I think I figured out what causes this. When training in Docker, exiting container leaving it running causes problems with console outputs for some reason

t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/datasets:/usr/src/datasets $t
python train.py
# wait a few epochs
CTRL+P+Q  # exit container and leave it running

Not sure what to do about this, we can't leave consoles attached as they'll eventually disconnect at some point over a few days of training.

EDIT: possible workaround:
https://stackoverflow.com/questions/38786615/docker-number-of-lines-in-terminal-changing-inside-docker
Once inside container: bash -c "stty cols $COLUMNS rows $LINES && bash". I tested this and it seems to work. I think Maybe I could embed this in YOLOv5 with if is_docker(): os.system(cmd).

EDIT2: nevermind, fix above does not work.

glenn-jocher · 2021-11-10T15:08:28Z

@AyushExel updated screenshot proving that W&B is not simply copying console output, W&B is introducing errors into the Logs. This is a GCP instance console window vs. what shows up in the W&B logs for the same training run.
https://wandb.ai/glenn-jocher/YOLOv5/runs/11f90i7k/logs?workspace=user-glenn-jocher

This is reproducible on every single training run I do with Docker. Runs are never logged correctly by W&B in Logs.

glenn-jocher · 2021-11-12T12:28:11Z

@AyushExel I'm observing this same Logs output bug with W&B in Google Colab. So now I can confirm this appears in all of the environments that I train models in: Docker and Colab. I don't have any W&B runs that Log correctly.

https://wandb.ai/glenn-jocher/test_VOC_ar_thr/runs/bxwvs7by/logs?workspace=user-glenn-jocher

AyushExel · 2021-11-12T12:30:13Z

@glenn-jocher yes. I'll update you on this once I hear back from the team.

AyushExel · 2021-11-17T09:29:47Z

@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix.
We'll investigate further about this issue if it still exists after the next release

glenn-jocher · 2021-11-17T12:16:55Z

@AyushExel great, glad to hear! I'll keep this open for now, let me know when the the fixes are in the latest pip package please.

glenn-jocher · 2022-01-06T20:08:41Z

@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix. We'll investigate further about this issue if it still exists after the next release

@AyushExel do you know the status of this? I've got an open TODO tag here, would be nice to remove if issue is resolved.

AyushExel · 2022-01-06T20:25:21Z

@glenn-jocher the network egress issue fix is verified.
The ticket for the tqdm progress bars is still open as it's difficult to verify without a reproducible script. If someone encounters it again, I'll increase the urgency of the ticket by upvoting.

glenn-jocher · 2022-01-07T02:59:14Z

@AyushExel thanks for looking into this! The Logs bug is really easy to reproduce, I think basically any training will show it eventually. You can use the official Colab notebook for example and train VOC models and the bug will appear after about an hour. Here's some code you can use to reproduce in Colab:

# 1. Setup
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt  # install

import torch
from yolov5 import utils
display = utils.notebook_init()  # checks

# 2. Weights & Biases
%pip install -q wandb
import wandb
wandb.login()

# 3. Reproduce
!python train.py --batch 64 --weights yolov5n.pt --data VOC.yaml --epochs 50 --img 512 --nosave --hyp hyp.finetune.yaml --project yolov5_wandb_reproduce --name yolov5n

Then you can observe the Colab output vs the W&B Log:
https://wandb.ai/glenn-jocher/yolov5_wandb_reproduce/runs/3lvzhcu6/logs?workspace=user-glenn-jocher

AyushExel · 2022-01-10T08:50:58Z

@glenn-jocher Thanks. Yes, I was able to repro and handed over this with the log files to the CLI engineers.

AyushExel · 2022-01-10T10:35:38Z

@glenn-jocher It's strange that it only starts to happen after a few mins. Like it doesn't usually happen with coco128 for, say 10 epochs

glenn-jocher · 2022-01-10T23:12:56Z

@AyushExel yes it's strange! Actually I think it might correlate with the number of output lines (or tqdm updates) rather than with time, because when training VOC in Colab with --cache (on high-mem Colab instances) it shows up quickly, like in 15 min around 10 epochs in. If you don't cache it takes an hour to show up, which also seems to correspond to about 10 epochs.

Seems like COCO128 with 300 epochs is just too small for it to show up (not enough tqdm updates).

Lechtr added the bug Something isn't working label Apr 1, 2021

glenn-jocher added the TODO High priority items label Apr 2, 2021

glenn-jocher assigned AyushExel Apr 8, 2021

farizrahman4u mentioned this issue Apr 26, 2021

[WB-5255][CLI-864][WB-5437] Console redirect perf + bug fixes wandb/wandb#2111

Merged

github-actions bot added the Stale Stale and schedule for closing soon label Jun 23, 2021

github-actions bot closed this as completed Jun 28, 2021

glenn-jocher removed the Stale Stale and schedule for closing soon label Oct 27, 2021

glenn-jocher self-assigned this Oct 27, 2021

glenn-jocher mentioned this issue Nov 10, 2021

Update train, val tqdm to fixed width #5367

Merged

glenn-jocher mentioned this issue Nov 10, 2021

W&B Network Usage Bug #5071

Closed

Active W&B-integration truncates output lines #2685

Active W&B-integration truncates output lines #2685

Comments

Lechtr commented Apr 1, 2021

🐛 Bug

To Reproduce (REQUIRED)

Expected behavior

Environment

Additional context

github-actions bot commented Apr 1, 2021 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

glenn-jocher commented Apr 2, 2021

glenn-jocher commented Apr 2, 2021

AyushExel commented Apr 2, 2021

AyushExel commented Apr 2, 2021

Lechtr commented Apr 2, 2021

AyushExel commented Apr 2, 2021

Lechtr commented Apr 2, 2021

Lechtr commented Apr 2, 2021 • edited Loading

AyushExel commented Apr 2, 2021

Lechtr commented Apr 2, 2021

AyushExel commented Apr 2, 2021

AyushExel commented Apr 3, 2021

Lechtr commented Apr 4, 2021

glenn-jocher commented Apr 5, 2021

AyushExel commented Apr 6, 2021

glenn-jocher commented Apr 6, 2021

glenn-jocher commented Apr 8, 2021 • edited Loading

AyushExel commented Apr 8, 2021

AyushExel commented Apr 26, 2021

glenn-jocher commented May 10, 2021

AyushExel commented May 11, 2021

AyushExel commented May 23, 2021

github-actions bot commented Jun 23, 2021 • edited by glenn-jocher Loading

glenn-jocher commented Oct 25, 2021

AyushExel commented Oct 25, 2021 • edited Loading

glenn-jocher commented Oct 26, 2021 • edited Loading

glenn-jocher commented Oct 27, 2021 • edited Loading

glenn-jocher commented Nov 10, 2021

glenn-jocher commented Nov 12, 2021 • edited Loading

AyushExel commented Nov 12, 2021

AyushExel commented Nov 17, 2021

glenn-jocher commented Nov 17, 2021

glenn-jocher commented Jan 6, 2022

AyushExel commented Jan 6, 2022

glenn-jocher commented Jan 7, 2022 • edited Loading

AyushExel commented Jan 10, 2022

AyushExel commented Jan 10, 2022

glenn-jocher commented Jan 10, 2022 • edited Loading

github-actions bot commented Apr 1, 2021 •

edited by UltralyticsAssistant

Loading

Lechtr commented Apr 2, 2021 •

edited

Loading

glenn-jocher commented Apr 8, 2021 •

edited

Loading

github-actions bot commented Jun 23, 2021 •

edited by glenn-jocher

Loading

AyushExel commented Oct 25, 2021 •

edited

Loading

glenn-jocher commented Oct 26, 2021 •

edited

Loading

glenn-jocher commented Oct 27, 2021 •

edited

Loading

glenn-jocher commented Nov 12, 2021 •

edited

Loading

glenn-jocher commented Jan 7, 2022 •

edited

Loading

glenn-jocher commented Jan 10, 2022 •

edited

Loading