Starting tensorboard inline within a Jupyter notebook consistently times out #4300

joyceerhl · 2020-11-09T20:42:07Z

Consider Stack Overflow for getting support using TensorBoard—they have
a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Do not use this template for for setup, installation, or configuration
issues. Instead, use the “installation problem” issue template:

https://github.com/tensorflow/tensorboard/issues/new?template=installation_problem.md

To report a problem with TensorBoard itself, please fill out the
remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same
environment from which you normally run TensorFlow/TensorBoard, and
paste the output here:

Diagnostics

Diagnostics output

--- check: autoidentify
INFO: diagnose_tensorboard.py version 724b56cee52e7d8eb89bbeec1f0d5ce3e38c9682

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=6, releaselevel='final', serial=0)
INFO: os.name: nt
INFO: os.uname(): N/A
INFO: sys.getwindowsversion(): sys.getwindowsversion(major=10, minor=0, build=19042, platform=2, service_pack='')

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.3.0
INFO: installed: tensorflow==2.3.1
INFO: installed: tensorflow-estimator==2.3.0

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.3.0'

--- check: tensorflow_python_version
2020-11-09 11:08:55.313364: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-11-09 11:08:55.314312: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO: tensorflow.__version__: '2.3.1'
INFO: tensorflow.__git_version__: 'v2.3.0-54-gfcc4b966f1'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'C:\\Users\\huer\\AppData\\Local\\Programs\\Python\\Python38\\Scripts\\tensorboard.exe\r\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 1024>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 1024>
Loopback infos: [(<AddressFamily.AF_INET6: 23>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET6: 23>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('::', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('0.0.0.0', 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'Surface-Laptop'

--- check: stat_tensorboardinfo
INFO: directory: C:\Users\huer\AppData\Local\Temp\.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=3940649676184517, st_dev=443364023, st_nlink=1, st_uid=0, st_gid=0, st_size=8192, st_atime=1604948751, st_mtime=1604948698, st_ctime=1603511741)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['C:\\Users\\huer\\AppData\\Local\\Programs\\Python\\Python38\\lib\\site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==0.11.0
appdirs==1.4.4
argon2-cffi==20.1.0
astunparse==1.6.3
async-generator==1.10
attrs==20.2.0
backcall==0.2.0
black==20.8b1
bleach==3.2.1
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.3
chardet==3.0.4
click==7.1.2
colorama==0.4.4
debugpy==1.1.0
decorator==4.4.2
defusedxml==0.6.0
entrypoints==0.3
Flask==1.1.2
gast==0.3.3
google-auth==1.23.0
google-auth-oauthlib==0.4.2
google-pasta==0.2.0
grpcio==1.33.2
h5py==2.10.0
idna==2.10
ipykernel==5.3.4
ipython==7.19.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.17.2
Jinja2==2.11.2
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==6.1.7
jupyter-console==6.2.0
jupyter-core==4.6.3
jupyterlab-pygments==0.1.2
Keras-Preprocessing==1.1.2
Markdown==3.3.3
MarkupSafe==1.1.1
mistune==0.8.4
mypy-extensions==0.4.3
nbclient==0.5.1
nbconvert==6.0.7
nbformat==5.0.8
nest-asyncio==1.4.2
notebook==6.1.4
numpy==1.18.5
oauthlib==3.1.0
opt-einsum==3.3.0
packaging==20.4
pandas==1.1.3
pandocfilters==1.4.3
parso==0.7.1
pathspec==0.8.0
pickleshare==0.7.5
pip==20.2.1
prometheus-client==0.8.0
prompt-toolkit==3.0.8
protobuf==3.13.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
Pygments==2.7.2
pyparsing==2.4.7
pyrsistent==0.17.3
python-dateutil==2.8.1
pytz==2020.4
pywin32==228
pywinpty==0.5.7
pyzmq==19.0.2
qtconsole==4.7.7
QtPy==1.9.0
regex==2020.9.27
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
selenium==3.141.0
Send2Trash==1.5.0
setuptools==49.2.1
six==1.15.0
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0
termcolor==1.1.0
terminado==0.9.1
testpath==0.4.4
toml==0.10.1
tornado==6.1
traitlets==5.0.5
typed-ast==1.4.1
typing-extensions==3.7.4.3
urllib3==1.25.11
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
wheel==0.35.1
widgetsnbextension==3.5.1
wrapt==1.12.1

For browser-related issues, please additionally specify:

Browser type and version (e.g., Chrome 64.0.3282.140):
Screenshot, if it’s a visual issue:

Issue description

Please describe the bug as clearly as possible. How can we reproduce the
problem without additional resources (including external data files and
proprietary Python modules)?

Repro steps:

Install Python 3.8.6 64-bit on Windows 10
Run python -m pip install jupyter tensorflow
Run jupyter notebook
Create a Jupyter notebook in the browser
Run %load_ext tensorboard in one cell, then %tensorboard --logdir logs/fit in a second cell
I'd expect to see the tensorboard website appear inline with the message about no active dashboards. Instead, I get a message about timing out waiting for TensorBoard to start:

7. If I rerun the cell with %tensorboard --logdir logs/fit, tensorboard does show up inline.

The diagnostic info above is based on my local machine. I initially thought my local box might simply have a busted tensorboard install, but the problem persisted even after uninstalling and reinstalling tensorflow and tensorboard. I was able to repro this on a clean Windows VM with the above repro steps.

Debugging the iopub messages sent by the Jupyter kernel, the Jupyter kernel does send a message with the 'Launching TensorBoard...' message in response to the execution request for %tensorboard --logdir logs/fit the first time, but it doesn't ever send the iframe back. For some reason requesting the tensorboard launch again results in an immediate response about 'Reusing TensorBoard on port <...>', followed by a message with the iframe for display in the cell output.

Happy to provide additional information that will help with diagnosing the issue!

The text was updated successfully, but these errors were encountered:

stephanwlee · 2020-11-12T18:42:12Z

Hi @joyceerhl, thanks for the report.

I tried reproducing it and failed to do so. It may truly be OS dependent.

Could you try to run %time import tensorflow as tf before %load_ext tensorboard call and report the cell output? Alternatively, can you try to install jupyter and tensorboard (not tensorflow; please uninstall tensorflow in your virtualenv) and reproduce the issue? Sadly, import tensorflow take a long time and can contribute to the long latency.

Technically speaking, we should never take more than a minute to load TensorFlow or TensorBoard so this is bad but above exercise may shine some light into the underlying issue.

joyceerhl · 2020-11-12T21:01:21Z

AFAICT this is not an import latency bug because TensorBoard actually starts up almost immediately. I can even access TensorBoard in a browser running on localhost less than 2 seconds after launching it inline in a notebook. The issue seems to be that even after TensorBoard is started, the notebook extension running inside the Jupyter kernel does not send back the iframe that is the result of executing tensorboard --logdir logs/fit. But if I rerun the cell, the iframe comes back right away. I'm not terribly familiar with how notebook extensions interact with the kernel, but it's almost certainly not an issue with importing tensorflow taking too long.

rmothukuru · 2020-11-21T17:01:34Z

@joyceerhl,
Can you please confirm if we can close this issue as the issue it not with importing Tensorflow or Tensorboard? Thanks!

stephanwlee · 2020-11-23T05:21:25Z

Sorry for not getting back to you, @joyceerhl. I had some trouble setting up my Windows environment then I forgot about this bug :(

I was able to reproduce your bug and was able to narrow it down a little. When running on Windows, we never get past this:

tensorboard/tensorboard/manager.py

Lines 433 to 435 in a46a6f6

    
           for info in get_all(): 
        
               if info.pid == p.pid and info.start_time >= start_time_seconds: 
        
                   return StartLaunched(info=info)

In order to narrow the problem down, I tried few things.

# subprogram.py
import os
print("subprogram", os.getpid())

# In TensorBoard's main.py
print("tb main", os.getpid())

# test_main.py
# p = subprocess.Popen(["python", "[PATH_TO]/subprogram.py"],)

p = subprocess.Popen(["tensorboard"],)
print("main", p.pid)

Weird thing is, when I run my subprogram, the pid on subprocess was equal to that of one queried within the subprogram.py while the same is not true for running tensorboard in the subprocess, which ultimately result in infinite poll and eventual timeout (when we re-execute the cell, existing program can be found and we do not spin forever).

~~I have not yet tried to reproduce this to see if this is a new Python version regression~~. I also did not attempt to remedy or find the real source of the bug.

I can reproduce the same issue on Python 3.7.1

@wchargin, would you have any idea how this can happen?

wchargin · 2020-11-24T22:00:09Z

If I understand correctly, you’re saying that when you launch
TensorBoard via subprocess.Popen, the Popen.pid does not match what
the TensorBoard Python code thinks is os.getpid() is. Is that right?

If so, I wonder whether there is some kind of wrapper script around
tensorboard on Windows that doesn’t exist on Linux/macOS. That does
sound like the kind of thing that they might do, and it could change
between Python versions.

Maybe one fix would be to relax the info.pid == p.pid check to instead
just find any matching instance with start_time large enough, and
assume that that’s the launched one. If there’s a collision, that’s
fine; you still get a correct TensorBoard, anyway. That should avoid
the pid checking issues.

This would explain why retrying %tensorboard immediately picks it up,
since we call _find_matching_instance before trying to start one.

@joyceerhl: Unfortunately, I have no Windows box on which to test (Linux
only here). But I can draft a patch and give you a Bazel command to
generate the Pip package. Or, you could just monkey-patch the changes
in, since they’re Python-only.

Let me try to come up with something and send it your way.

@stephanwlee

Summary: When the `%tensorboard` cell magic is invoked, we compute a cache key for the “hermetic environment”, primarily args to `%tensorboard` and the working directory. We first check whether any running TensorBoard instances match that cache key, and launch a new instance if none do. But then, while polling for the new instance to have launched, we had a different matching criterion, checking for a process ID match instead of a cache key match. The idea was that “is this TensorBoard instance’s PID equal to the PID of the subprocess that we just spawned?” would be a more reliable check. But on Windows ((╯°□°）╯︵ ┻━┻) this is not the case, presumably because the `tensorboard` console script has some kind of wrapper process in certain versions of Python. This manifested as “`%tensorboard` always times out on the first invocation, but works immediately when I invoke it again”, since invoking it again triggers the cache key check rather than the PID check. So we now just check by cache key in all cases, and the logic is consistent, if a bit less precise overall. Fixes #4300. Test Plan: Still works for me on Linux, with both new and existing TensorBoard processes across multiple (concurrent) cache keys. @stephanwlee can repro the bug and fix on Windows with Python 3.8. wchargin-branch: notebook-poll-no-pid-filter

wchargin · 2020-12-01T18:48:46Z

Hi @joyceerhl—we think we have a fix for this in #4407. It should go out
in today’s tb-nightly, or you can build your own Pip package from head
with bazel build //tensorboard/pip_package, or you can manually patch
the diff into your virtualenv (it’s a small change to one Python file).
If this doesn’t fix it on your end, please let us know and we can reopen
this. Thanks!

joyceerhl · 2020-12-01T19:12:19Z

Patched the diff in and it works perfectly! Thanks so much for tracking down the fix 😊

wchargin · 2020-12-01T19:54:00Z

Excellent! Thanks for letting us know.

rmothukuru self-assigned this Nov 10, 2020

rmothukuru added core:frontend os:windows type:bug stat:awaiting tensorflower labels Nov 10, 2020

rmothukuru assigned stephanwlee and unassigned rmothukuru Nov 10, 2020

stephanwlee removed the core:frontend label Nov 12, 2020

stephanwlee assigned wchargin and unassigned stephanwlee Nov 23, 2020

wchargin mentioned this issue Dec 1, 2020

notebook: don’t filter polled instances by PID #4407

Merged

wchargin closed this as completed in #4407 Dec 1, 2020

wchargin mentioned this issue Dec 2, 2020

"ERROR: Timed out waiting for TensorBoard to start." on Jupyter #2739

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting tensorboard inline within a Jupyter notebook consistently times out #4300

Starting tensorboard inline within a Jupyter notebook consistently times out #4300

joyceerhl commented Nov 9, 2020

stephanwlee commented Nov 12, 2020

joyceerhl commented Nov 12, 2020 •

edited

Loading

rmothukuru commented Nov 21, 2020

stephanwlee commented Nov 23, 2020 •

edited

Loading

wchargin commented Nov 24, 2020

wchargin commented Dec 1, 2020

joyceerhl commented Dec 1, 2020

wchargin commented Dec 1, 2020

Starting tensorboard inline within a Jupyter notebook consistently times out #4300

Starting tensorboard inline within a Jupyter notebook consistently times out #4300

Comments

joyceerhl commented Nov 9, 2020

Environment information (required)

Diagnostics

Issue description

stephanwlee commented Nov 12, 2020

joyceerhl commented Nov 12, 2020 • edited Loading

rmothukuru commented Nov 21, 2020

stephanwlee commented Nov 23, 2020 • edited Loading

wchargin commented Nov 24, 2020

wchargin commented Dec 1, 2020

joyceerhl commented Dec 1, 2020

wchargin commented Dec 1, 2020

joyceerhl commented Nov 12, 2020 •

edited

Loading

stephanwlee commented Nov 23, 2020 •

edited

Loading