Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting tensorboard inline within a Jupyter notebook consistently times out #4300

Closed
joyceerhl opened this issue Nov 9, 2020 · 8 comments · Fixed by #4407
Closed

Starting tensorboard inline within a Jupyter notebook consistently times out #4300

joyceerhl opened this issue Nov 9, 2020 · 8 comments · Fixed by #4407

Comments

@joyceerhl
Copy link
Contributor

Consider Stack Overflow for getting support using TensorBoard—they have
a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Do not use this template for for setup, installation, or configuration
issues. Instead, use the “installation problem” issue template:

https://github.com/tensorflow/tensorboard/issues/new?template=installation_problem.md

To report a problem with TensorBoard itself, please fill out the
remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same
environment from which you normally run TensorFlow/TensorBoard, and
paste the output here:

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version 724b56cee52e7d8eb89bbeec1f0d5ce3e38c9682

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=6, releaselevel='final', serial=0)
INFO: os.name: nt
INFO: os.uname(): N/A
INFO: sys.getwindowsversion(): sys.getwindowsversion(major=10, minor=0, build=19042, platform=2, service_pack='')

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.3.0
INFO: installed: tensorflow==2.3.1
INFO: installed: tensorflow-estimator==2.3.0

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.3.0'

--- check: tensorflow_python_version
2020-11-09 11:08:55.313364: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-11-09 11:08:55.314312: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO: tensorflow.__version__: '2.3.1'
INFO: tensorflow.__git_version__: 'v2.3.0-54-gfcc4b966f1'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'C:\\Users\\huer\\AppData\\Local\\Programs\\Python\\Python38\\Scripts\\tensorboard.exe\r\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 1024>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 1024>
Loopback infos: [(<AddressFamily.AF_INET6: 23>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET6: 23>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('::', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 0, '', ('0.0.0.0', 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): 'Surface-Laptop'

--- check: stat_tensorboardinfo
INFO: directory: C:\Users\huer\AppData\Local\Temp\.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=3940649676184517, st_dev=443364023, st_nlink=1, st_uid=0, st_gid=0, st_size=8192, st_atime=1604948751, st_mtime=1604948698, st_ctime=1603511741)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['C:\\Users\\huer\\AppData\\Local\\Programs\\Python\\Python38\\lib\\site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==0.11.0
appdirs==1.4.4
argon2-cffi==20.1.0
astunparse==1.6.3
async-generator==1.10
attrs==20.2.0
backcall==0.2.0
black==20.8b1
bleach==3.2.1
cachetools==4.1.1
certifi==2020.6.20
cffi==1.14.3
chardet==3.0.4
click==7.1.2
colorama==0.4.4
debugpy==1.1.0
decorator==4.4.2
defusedxml==0.6.0
entrypoints==0.3
Flask==1.1.2
gast==0.3.3
google-auth==1.23.0
google-auth-oauthlib==0.4.2
google-pasta==0.2.0
grpcio==1.33.2
h5py==2.10.0
idna==2.10
ipykernel==5.3.4
ipython==7.19.0
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.17.2
Jinja2==2.11.2
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==6.1.7
jupyter-console==6.2.0
jupyter-core==4.6.3
jupyterlab-pygments==0.1.2
Keras-Preprocessing==1.1.2
Markdown==3.3.3
MarkupSafe==1.1.1
mistune==0.8.4
mypy-extensions==0.4.3
nbclient==0.5.1
nbconvert==6.0.7
nbformat==5.0.8
nest-asyncio==1.4.2
notebook==6.1.4
numpy==1.18.5
oauthlib==3.1.0
opt-einsum==3.3.0
packaging==20.4
pandas==1.1.3
pandocfilters==1.4.3
parso==0.7.1
pathspec==0.8.0
pickleshare==0.7.5
pip==20.2.1
prometheus-client==0.8.0
prompt-toolkit==3.0.8
protobuf==3.13.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
Pygments==2.7.2
pyparsing==2.4.7
pyrsistent==0.17.3
python-dateutil==2.8.1
pytz==2020.4
pywin32==228
pywinpty==0.5.7
pyzmq==19.0.2
qtconsole==4.7.7
QtPy==1.9.0
regex==2020.9.27
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
selenium==3.141.0
Send2Trash==1.5.0
setuptools==49.2.1
six==1.15.0
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0
termcolor==1.1.0
terminado==0.9.1
testpath==0.4.4
toml==0.10.1
tornado==6.1
traitlets==5.0.5
typed-ast==1.4.1
typing-extensions==3.7.4.3
urllib3==1.25.11
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==1.0.1
wheel==0.35.1
widgetsnbextension==3.5.1
wrapt==1.12.1

For browser-related issues, please additionally specify:

  • Browser type and version (e.g., Chrome 64.0.3282.140):
  • Screenshot, if it’s a visual issue:

Issue description

Please describe the bug as clearly as possible. How can we reproduce the
problem without additional resources (including external data files and
proprietary Python modules)?

Repro steps:

  1. Install Python 3.8.6 64-bit on Windows 10
  2. Run python -m pip install jupyter tensorflow
  3. Run jupyter notebook
  4. Create a Jupyter notebook in the browser
  5. Run %load_ext tensorboard in one cell, then %tensorboard --logdir logs/fit in a second cell
  6. I'd expect to see the tensorboard website appear inline with the message about no active dashboards. Instead, I get a message about timing out waiting for TensorBoard to start:

image
7. If I rerun the cell with %tensorboard --logdir logs/fit, tensorboard does show up inline.

The diagnostic info above is based on my local machine. I initially thought my local box might simply have a busted tensorboard install, but the problem persisted even after uninstalling and reinstalling tensorflow and tensorboard. I was able to repro this on a clean Windows VM with the above repro steps.

Debugging the iopub messages sent by the Jupyter kernel, the Jupyter kernel does send a message with the 'Launching TensorBoard...' message in response to the execution request for %tensorboard --logdir logs/fit the first time, but it doesn't ever send the iframe back. For some reason requesting the tensorboard launch again results in an immediate response about 'Reusing TensorBoard on port <...>', followed by a message with the iframe for display in the cell output.

Happy to provide additional information that will help with diagnosing the issue!

@stephanwlee
Copy link
Contributor

Hi @joyceerhl, thanks for the report.

I tried reproducing it and failed to do so. It may truly be OS dependent.

Could you try to run %time import tensorflow as tf before %load_ext tensorboard call and report the cell output? Alternatively, can you try to install jupyter and tensorboard (not tensorflow; please uninstall tensorflow in your virtualenv) and reproduce the issue? Sadly, import tensorflow take a long time and can contribute to the long latency.

Technically speaking, we should never take more than a minute to load TensorFlow or TensorBoard so this is bad but above exercise may shine some light into the underlying issue.

@joyceerhl
Copy link
Contributor Author

joyceerhl commented Nov 12, 2020

AFAICT this is not an import latency bug because TensorBoard actually starts up almost immediately. I can even access TensorBoard in a browser running on localhost less than 2 seconds after launching it inline in a notebook. The issue seems to be that even after TensorBoard is started, the notebook extension running inside the Jupyter kernel does not send back the iframe that is the result of executing tensorboard --logdir logs/fit. But if I rerun the cell, the iframe comes back right away. I'm not terribly familiar with how notebook extensions interact with the kernel, but it's almost certainly not an issue with importing tensorflow taking too long.

@rmothukuru
Copy link

@joyceerhl,
Can you please confirm if we can close this issue as the issue it not with importing Tensorflow or Tensorboard? Thanks!

@stephanwlee
Copy link
Contributor

stephanwlee commented Nov 23, 2020

Sorry for not getting back to you, @joyceerhl. I had some trouble setting up my Windows environment then I forgot about this bug :(

I was able to reproduce your bug and was able to narrow it down a little. When running on Windows, we never get past this:

for info in get_all():
if info.pid == p.pid and info.start_time >= start_time_seconds:
return StartLaunched(info=info)

In order to narrow the problem down, I tried few things.

# subprogram.py
import os
print("subprogram", os.getpid())

# In TensorBoard's main.py
print("tb main", os.getpid())

# test_main.py
# p = subprocess.Popen(["python", "[PATH_TO]/subprogram.py"],)

p = subprocess.Popen(["tensorboard"],)
print("main", p.pid)

Weird thing is, when I run my subprogram, the pid on subprocess was equal to that of one queried within the subprogram.py while the same is not true for running tensorboard in the subprocess, which ultimately result in infinite poll and eventual timeout (when we re-execute the cell, existing program can be found and we do not spin forever).

I have not yet tried to reproduce this to see if this is a new Python version regression. I also did not attempt to remedy or find the real source of the bug.

I can reproduce the same issue on Python 3.7.1

@wchargin, would you have any idea how this can happen?

@stephanwlee stephanwlee assigned wchargin and unassigned stephanwlee Nov 23, 2020
@wchargin
Copy link
Contributor

If I understand correctly, you’re saying that when you launch
TensorBoard via subprocess.Popen, the Popen.pid does not match what
the TensorBoard Python code thinks is os.getpid() is. Is that right?

If so, I wonder whether there is some kind of wrapper script around
tensorboard on Windows that doesn’t exist on Linux/macOS. That does
sound like the kind of thing that they might do, and it could change
between Python versions.

Maybe one fix would be to relax the info.pid == p.pid check to instead
just find any matching instance with start_time large enough, and
assume that that’s the launched one. If there’s a collision, that’s
fine; you still get a correct TensorBoard, anyway. That should avoid
the pid checking issues.

This would explain why retrying %tensorboard immediately picks it up,
since we call _find_matching_instance before trying to start one.

@joyceerhl: Unfortunately, I have no Windows box on which to test (Linux
only here). But I can draft a patch and give you a Bazel command to
generate the Pip package. Or, you could just monkey-patch the changes
in, since they’re Python-only.

Let me try to come up with something and send it your way.

wchargin added a commit that referenced this issue Dec 1, 2020
Summary:
When the `%tensorboard` cell magic is invoked, we compute a cache key
for the “hermetic environment”, primarily args to `%tensorboard` and the
working directory. We first check whether any running TensorBoard
instances match that cache key, and launch a new instance if none do.
But then, while polling for the new instance to have launched, we had a
different matching criterion, checking for a process ID match instead of
a cache key match.

The idea was that “is this TensorBoard instance’s PID equal to the PID
of the subprocess that we just spawned?” would be a more reliable check.
But on Windows ((╯°□°)╯︵ ┻━┻) this is not the case, presumably because
the `tensorboard` console script has some kind of wrapper process in
certain versions of Python. This manifested as “`%tensorboard` always
times out on the first invocation, but works immediately when I invoke
it again”, since invoking it again triggers the cache key check rather
than the PID check. So we now just check by cache key in all cases, and
the logic is consistent, if a bit less precise overall.

Fixes #4300.

Test Plan:
Still works for me on Linux, with both new and existing TensorBoard
processes across multiple (concurrent) cache keys. @stephanwlee can
repro the bug and fix on Windows with Python 3.8.

wchargin-branch: notebook-poll-no-pid-filter
@wchargin
Copy link
Contributor

wchargin commented Dec 1, 2020

Hi @joyceerhl—we think we have a fix for this in #4407. It should go out
in today’s tb-nightly, or you can build your own Pip package from head
with bazel build //tensorboard/pip_package, or you can manually patch
the diff into your virtualenv (it’s a small change to one Python file).
If this doesn’t fix it on your end, please let us know and we can reopen
this. Thanks!

@joyceerhl
Copy link
Contributor Author

Patched the diff in and it works perfectly! Thanks so much for tracking down the fix 😊

@wchargin
Copy link
Contributor

wchargin commented Dec 1, 2020

Excellent! Thanks for letting us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants