Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Active W&B-integration truncates output lines #2685

Closed
Lechtr opened this issue Apr 1, 2021 · 39 comments · Fixed by wandb/wandb#2111
Closed

Active W&B-integration truncates output lines #2685

Lechtr opened this issue Apr 1, 2021 · 39 comments · Fixed by wandb/wandb#2111
Assignees
Labels
bug Something isn't working TODO High priority items

Comments

@Lechtr
Copy link

Lechtr commented Apr 1, 2021

Before submitting a bug report, please be aware that your issue must be reproducible with all of the following, otherwise it is non-actionable, and we can not help you:

If this is a custom dataset/training question you must include your train*.jpg, test*.jpg and results.png figures, or we can not help you. You can generate these with utils.plot_results().

🐛 Bug

If W&B is activated, certain lines of output & progression bars are truncated.
Affected lines are for example cache progress and training epoch time/progress bar.

To Reproduce (REQUIRED)

Use YOLOv5-tutorial colab.
Execute:

  • Setup
  • Download COCO128
  • Weights & Biases
  • Train YOLOv5s on COCO128 for 3 epochs

Expected behavior

Expected output (with W&B off):
caching:

train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 1281314.83it/s]
train: Caching images (0.1GB): 100% 128/128 [00:00<00:00, 203.89it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found, 0 missing, 2 empty, 0 corrupted: 100% 128/128 [00:00<00:00, 613566.76it/s]
val: Caching images (0.1GB): 100% 128/128 [00:00<00:00, 147.05it/s]

training:

Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       1/2     6.65G   0.04431   0.06402     0.019    0.1273       166       640: 100% 8/8 [00:01<00:00,  5.20it/s]
               Class      Images      Labels           P           R      [email protected]  [email protected]:.95: 100% 4/4 [00:01<00:00,  3.30it/s]
                 all         128         929       0.663       0.625       0.658       0.432

Actual output (with W&B on):
caching:

train: Scanning '../coco128/labels/train2017.cache' images and labels... 128 fou
train: Caching images (0.1GB): 100%|█████████| 128/128 [00:00<00:00, 202.00it/s]
val: Scanning '../coco128/labels/train2017.cache' images and labels... 128 found
val: Caching images (0.1GB): 100%|███████████| 128/128 [00:00<00:00, 144.14it/s]

training:

Epoch   gpu_mem       box       obj       cls     total    labels  img_size
       0/2     3.29G   0.04237   0.06416   0.02121    0.1277       183       640
               Class      Images      Labels           P           R      [email protected]
                 all         128         929       0.641       0.637       0.661       0.431

Environment

Google Colab YOLOv5-tutorial
Also on my personal colab with custom data.

Additional context

If W&B is de-/activated with !wandb disabled or !wandb enabled it will turn off/on this behaviour.

@Lechtr Lechtr added the bug Something isn't working label Apr 1, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Apr 1, 2021

👋 Hello @Lechtr, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at [email protected].

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher glenn-jocher added the TODO High priority items label Apr 2, 2021
@glenn-jocher
Copy link
Member

@Lechtr thanks for spotting this bug! I repeated your steps and I see the same truncated output. I'm not sure what the cause might be, but perhaps our W&B expert @AyushExel can help.

@glenn-jocher
Copy link
Member

@AyushExel it seems W&B is truncating output somehow in Colab. These orange areas are missing when logging is enabled.
Screenshot 2021-04-02 at 13 22 56

@AyushExel
Copy link
Contributor

@glenn-jocher @Lechtr thanks for reporting. Looking into this now.
I was able to reproduce this only on colab. Can you also reproduce this on a local machine? For me, it works as expected on Linux Desktop.
I'm trying to get to the root cause of the problem. I'll update soon

@AyushExel
Copy link
Contributor

Also, my first instinct was that this is a problem related to tqdm with wandb but I cannot reproduce this in any example other than the train.py script. A simpler reproducible example would be very helpful for engineers.
I tried this:

import wandb
from tqdm import tqdm

run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
  print(x)

But this works fine. A simpler example would be very helpful.

@Lechtr
Copy link
Author

Lechtr commented Apr 2, 2021

Could it be an issue related to a new version of wandb?
That might explain why it doesn't work in Colab, because there you have the newest version, but works on your local machine, where you might have an older version installed.

I don't have time now, but will test it later.

@AyushExel
Copy link
Contributor

@Lechtr I have tested with the latest version of wandb on my local machine. It works fine. The problem seems to be only with colab

@Lechtr
Copy link
Author

Lechtr commented Apr 2, 2021

@AyushExel It is related to the latest wandb release 0.10.24.
If i try out the version before, 0.10.23, with %pip install -Iv wandb==0.10.23 , it works fine.
I will check the changes made in the new release, maybe I will find the problem.

@Lechtr
Copy link
Author

Lechtr commented Apr 2, 2021

@AyushExel In the changelog for 0.10.24 it says

Significant fixes to stdout/stderr console logging

so that seems to be the direction we are looking for.

Especially this commit https://github.com/wandb/client/commit/f2b191e6c5073abec9eb4d8efe5a26c8cdd35d1d
made some changes in the file wandb/sdk/internal/file_stream.py that seem interesting :

The following class in there sounds like a promising area to take a closer look at (line 72):

class CRDedupeFilePolicy(DefaultFilePolicy):
    """File stream policy that removes characters that would be erased by
    carriage returns.
    This is what a terminal does. We use it for console output to reduce the
    amount of data we need to send over the network (eg. for progress bars),
    while preserving the output's appearance in the web app.
    """

but I don't understand it at first glance now, maybe you have more experience there.

Might also be an idea to open an issue at the wandb-project.

@AyushExel
Copy link
Contributor

@Lechtr a reproducible example will be more helpful here and then I can get CLI engineers involved. This seems like a specific tqdm+wandb relaed issue on colab as the tqdm progress bar and status is truncated. But I was not able to reproduce a simpler example. For example I tried this:

import wandb
from tqdm import tqdm

run = wandb.init()
l = [10,100,1000]
pbar = enumerate(l)
pbar = tqdm(pbar, total=3)
for i, x in pbar:
  print(x)

This works as intended but the output from train.py is truncated. Are you able to reproduce this on a simpler example so that it is easier to pin-point the source?

@Lechtr
Copy link
Author

Lechtr commented Apr 2, 2021

@AyushExel Your script gives following output:

0.10.24:

  0%|                                                     | 0/3 [00:00<?, ?it/s]10
100
1000
100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 3293.09it/s]

0.10.23:

10
100
1000
100%|██████████| 3/3 [00:00<00:00, 5291.38it/s]

I assume it is not supposed to be different?

@AyushExel
Copy link
Contributor

@Lechtr yes. That's the expected output. The progress bar is not getting truncated. But in train.py it gets truncated. We need a minimal example that reproduces this issue.

@AyushExel
Copy link
Contributor

@Lechtr were you able to reproduce this bug on a simple example? If so, can you please share it here, it'll be helpful for debugging.
If not, I can get someone to look into it nevertheless but that'll happen on Monday

@Lechtr
Copy link
Author

Lechtr commented Apr 4, 2021

@AyushExel no sorry, I could not find a simple example to reproduce.

@glenn-jocher
Copy link
Member

@AyushExel I think since it's reproducible in Colab you should probably notify the CLI team and point them to their changes highlighted by #2685 (comment), as they may have introduced a bug in notebook environments.

@AyushExel
Copy link
Contributor

@glenn-jocher @Lechtr I've filed a ticket to fix this. You can expect this to get resolved in the next release

@glenn-jocher
Copy link
Member

@AyushExel yay thanks!

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 8, 2021

@AyushExel @Lechtr I confirm I can reproduce this, and also I should note that the truncated output is not logged to W&B either, so it's lost/invisible in both Colab and W&B. Also note print statements (white) work fine, logger.log() statements (red) are truncated.
https://wandb.ai/glenn-jocher/YOLOv5/runs/23u3uq0x/logs?workspace=user-glenn-jocher
Screenshot 2021-04-08 at 23 58 38

@AyushExel
Copy link
Contributor

Yes. This should be addressed in the next release. Engineers are on it.

@AyushExel
Copy link
Contributor

@glenn-jocher This will be fixed by this PR in wandb CLI -> wandb/wandb#2111

@glenn-jocher
Copy link
Member

@AyushExel should I remove the TODO and close this then? It looks like wandb/wandb#2111 is not merged yet though.

@AyushExel
Copy link
Contributor

@glenn-jocher I'll just confirm this with someone.

@AyushExel
Copy link
Contributor

@glenn-jocher The progress bar is missing on colabs even without wandb. I've tried it without running the wandb initialization cell and the progress bar is still missing.
Screenshot (104)

Here's what Fariz, the CLI engineer working on the PR had to day about this:

running it without wandb (or `WANDB_CONSOLE="off"`)? I'm sure the progress bar wont be printed in that case either. 
We should be matching the behavior without wandb when wandb is initialized as well.
Now a bit more post-mortem on why the progress bar is visible when older versions of wandb are initialized but not with 
current branch or when wandb is not initialized.. TQDM prints the progressbar based on window width returned by an ioctl
 system call. This would be zero for subprocesses spawned in colab and would lead to a zero width progress bar in such
 cases.
 In older versions of wandb we fallback to 80 if the width is 0 (https://github.com/wandb/client/blob/master/wandb/sdk/lib/redirect.py#L573), 
which causes the progressbar to render.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 23, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Jun 23, 2021
@glenn-jocher
Copy link
Member

@AyushExel I've been running into an issue which I've been seeing most of the year. W&B runs with Docker do not produce correct outputs in the 'Logs' tab, as the tqdm progress bars are not handled correctly by W&B here. I couldn't find an open issue on this specifically, but I know we've discussed this before. Is there any solution to this, or one in the works? Let me know please, as this affects all of my runs (I train 100% in Docker), and complicates efforts to share runs, i.e. all of the v6.0 runs for our recent release are affected.
https://wandb.ai/glenn-jocher/yolov5_v6/runs/1yyym1gq/logs?workspace

Screen Shot 2021-10-25 at 4 53 43 PM

@AyushExel
Copy link
Contributor

AyushExel commented Oct 25, 2021

@glenn-jocher I can get someone to look into this if this is reproducible. I tried running on docker and the outputs for me are fine https://wandb.ai/cayush/yoloV5/runs/2rs2bbqg/logs?workspace=user-cayush .

I think this problem is of cloud terminals not handling the tqdm outputs correctly. I've also seen this happen a lot, but I think its because the terminal breaks and produces these broken lines. wandb just captures whatever is there in the terminal.
If this happens due to wandb, then its definitely a bug. But I don't think that's the case. I might be wrong though

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 26, 2021

@AyushExel to reproduce you can use the same command I used to train the official models in docker. I don't actually view any output, I use a redirect to dev null in background with no terminal window:

python train.py --data coco.yaml --batch 64 --weights '' --cfg yolov5n.yaml --epochs 300 --img 640 --device 0  > /dev/null 2>&1 &

Training usually proceeds fine for a 3-4 epochs before this error occurs, and then the rest of training displays the error. This might be related to the W&B memory leak error also if all of this output is being transmitted to W&B.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 27, 2021

@AyushExel I think I figured out what causes this. When training in Docker, exiting container leaving it running causes problems with console outputs for some reason

t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/datasets:/usr/src/datasets $t
python train.py
# wait a few epochs
CTRL+P+Q  # exit container and leave it running

Not sure what to do about this, we can't leave consoles attached as they'll eventually disconnect at some point over a few days of training.

EDIT: possible workaround:
https://stackoverflow.com/questions/38786615/docker-number-of-lines-in-terminal-changing-inside-docker
Once inside container: bash -c "stty cols $COLUMNS rows $LINES && bash". I tested this and it seems to work. I think Maybe I could embed this in YOLOv5 with if is_docker(): os.system(cmd).

EDIT2: nevermind, fix above does not work.
Screen Shot 2021-10-27 at 4 27 41 PM

@glenn-jocher glenn-jocher removed the Stale Stale and schedule for closing soon label Oct 27, 2021
@glenn-jocher glenn-jocher self-assigned this Oct 27, 2021
@glenn-jocher
Copy link
Member

@AyushExel updated screenshot proving that W&B is not simply copying console output, W&B is introducing errors into the Logs. This is a GCP instance console window vs. what shows up in the W&B logs for the same training run.
https://wandb.ai/glenn-jocher/YOLOv5/runs/11f90i7k/logs?workspace=user-glenn-jocher

This is reproducible on every single training run I do with Docker. Runs are never logged correctly by W&B in Logs.

Screenshot 2021-11-10 at 16 03 44

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 12, 2021

@AyushExel I'm observing this same Logs output bug with W&B in Google Colab. So now I can confirm this appears in all of the environments that I train models in: Docker and Colab. I don't have any W&B runs that Log correctly.

https://wandb.ai/glenn-jocher/test_VOC_ar_thr/runs/bxwvs7by/logs?workspace=user-glenn-jocher
Screenshot 2021-11-12 at 13 27 10

@AyushExel
Copy link
Contributor

@glenn-jocher yes. I'll update you on this once I hear back from the team.

@AyushExel
Copy link
Contributor

@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix.
We'll investigate further about this issue if it still exists after the next release

@glenn-jocher
Copy link
Member

@AyushExel great, glad to hear! I'll keep this open for now, let me know when the the fixes are in the latest pip package please.

@glenn-jocher
Copy link
Member

@glenn-jocher A fix related to this has been merged in master along with the network egress issue fix. We'll investigate further about this issue if it still exists after the next release

@AyushExel do you know the status of this? I've got an open TODO tag here, would be nice to remove if issue is resolved.

@AyushExel
Copy link
Contributor

@glenn-jocher the network egress issue fix is verified.
The ticket for the tqdm progress bars is still open as it's difficult to verify without a reproducible script. If someone encounters it again, I'll increase the urgency of the ticket by upvoting.

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 7, 2022

@AyushExel thanks for looking into this! The Logs bug is really easy to reproduce, I think basically any training will show it eventually. You can use the official Colab notebook for example and train VOC models and the bug will appear after about an hour. Here's some code you can use to reproduce in Colab:

# 1. Setup
!git clone https://github.com/ultralytics/yolov5  # clone
%cd yolov5
%pip install -qr requirements.txt  # install

import torch
from yolov5 import utils
display = utils.notebook_init()  # checks

# 2. Weights & Biases
%pip install -q wandb
import wandb
wandb.login()

# 3. Reproduce
!python train.py --batch 64 --weights yolov5n.pt --data VOC.yaml --epochs 50 --img 512 --nosave --hyp hyp.finetune.yaml --project yolov5_wandb_reproduce --name yolov5n

Then you can observe the Colab output vs the W&B Log:
https://wandb.ai/glenn-jocher/yolov5_wandb_reproduce/runs/3lvzhcu6/logs?workspace=user-glenn-jocher
Screen Shot 2022-01-06 at 6 54 57 PM

@AyushExel
Copy link
Contributor

@glenn-jocher Thanks. Yes, I was able to repro and handed over this with the log files to the CLI engineers.

@AyushExel
Copy link
Contributor

@glenn-jocher It's strange that it only starts to happen after a few mins. Like it doesn't usually happen with coco128 for, say 10 epochs

@glenn-jocher
Copy link
Member

glenn-jocher commented Jan 10, 2022

@AyushExel yes it's strange! Actually I think it might correlate with the number of output lines (or tqdm updates) rather than with time, because when training VOC in Colab with --cache (on high-mem Colab instances) it shows up quickly, like in 15 min around 10 epochs in. If you don't cache it takes an hour to show up, which also seems to correspond to about 10 epochs.

Seems like COCO128 with 300 epochs is just too small for it to show up (not enough tqdm updates).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TODO High priority items
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants