Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression affecting IJulia #47196

Closed
zickgraf opened this issue Oct 17, 2022 · 13 comments
Closed

Regression affecting IJulia #47196

zickgraf opened this issue Oct 17, 2022 · 13 comments
Labels
multithreading Base.Threads and related functionality regression Regression in behavior compared to a previous version
Milestone

Comments

@zickgraf
Copy link
Contributor

zickgraf commented Oct 17, 2022

Summary: b7201d6 introduces a regression in IJulia. When trying to start a Julia kernel in a Jupyter notebook, Julia segfaults.

  1. The output of versioninfo()
Julia Version 1.9.0-DEV.1602
Commit 69f8a7b6481 (2022-10-17 08:33 UTC) (note: bisected to the commit mentioned above)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores
  1. How you installed Julia
    Both via compiling from source and via downloading a nightly build.
  2. A minimal working example (MWE), also known as a minimum reproducible example
    Install IJulia, run notebook() and create a new notebook with kernel "Julia 1.9.0-DEV". After a timeout of about 1 minute, an error message "Connection failed" appears. Running everything with strace shows that the child Julia process launched by IJulia has segfaulted.
@giordano giordano added regression Regression in behavior compared to a previous version multithreading Base.Threads and related functionality labels Oct 17, 2022
@matrixbot123
Copy link

Can you share the error message that was shown . That you got when launching notebook

@zickgraf
Copy link
Contributor Author

The full error message when creating a new notebook with kernel "Julia 1.9.0-DEV" reads:

Connection failed

A connection to the notebook server could not be established. The notebook will continue trying to reconnect. Check your network connection or notebook server configuration.

When launching jupyter directly via jupyter-notebook in a terminal, I get the following output:

[I 11:14:13.507 NotebookApp] Authentication of /metrics is OFF, since other authentication is disabled.
[W 11:14:13.679 NotebookApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
[W 2022-10-18 11:14:14.038 LabApp] 'notebook_dir' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-10-18 11:14:14.038 LabApp] 'use_redirect_file' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-10-18 11:14:14.038 LabApp] 'token' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-10-18 11:14:14.038 LabApp] 'password' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2022-10-18 11:14:14.038 LabApp] 'password' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2022-10-18 11:14:14.045 LabApp] JupyterLab extension loaded from /usr/lib/python3.10/site-packages/jupyterlab
[I 2022-10-18 11:14:14.045 LabApp] JupyterLab application directory is /usr/share/jupyter/lab
[I 11:14:14.049 NotebookApp] Serving notebooks from local directory: ****
[I 11:14:14.049 NotebookApp] Jupyter Notebook 6.4.12 is running at:
[I 11:14:14.049 NotebookApp] http://localhost:8888/
[I 11:14:14.049 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[I 11:14:18.875 NotebookApp] Creating new notebook in 
[I 11:14:20.042 NotebookApp] Kernel started: 48e5b1e1-f88a-4f95-858c-072ed03a199e, name: julia-1.9
[W 11:14:20.086 NotebookApp] 404 GET /nbextensions/widgets/notebook/js/extension.js?v=20221018111413 (127.0.0.1) 5.640000ms referer=http://localhost:8888/notebooks/Untitled.ipynb?kernel_name=julia-1.9
Starting kernel event loops.
[W 11:14:42.048 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:15:04.061 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:15:20.066 NotebookApp] Timeout waiting for kernel_info reply from 48e5b1e1-f88a-4f95-858c-072ed03a199e
[I 11:15:20.570 NotebookApp] Starting buffering for 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[I 11:15:20.571 NotebookApp] Restoring connection for 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:15:25.084 NotebookApp] Nudge: attempt 10 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:15:30.095 NotebookApp] Nudge: attempt 20 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:15:35.106 NotebookApp] Nudge: attempt 30 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:15:40.116 NotebookApp] Nudge: attempt 40 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:15:45.127 NotebookApp] Nudge: attempt 50 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:15:50.137 NotebookApp] Nudge: attempt 60 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:15:55.147 NotebookApp] Nudge: attempt 70 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:16:00.158 NotebookApp] Nudge: attempt 80 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:16:05.170 NotebookApp] Nudge: attempt 90 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:16:10.181 NotebookApp] Nudge: attempt 100 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:16:15.192 NotebookApp] Nudge: attempt 110 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[W 11:16:20.203 NotebookApp] Nudge: attempt 120 on kernel 48e5b1e1-f88a-4f95-858c-072ed03a199e
[E 11:16:20.573 NotebookApp] Uncaught exception GET /api/kernels/48e5b1e1-f88a-4f95-858c-072ed03a199e/channels?session_id=**** (127.0.0.1)
    HTTPServerRequest(protocol='http', host='localhost:8888', method='GET', uri='/api/kernels/48e5b1e1-f88a-4f95-858c-072ed03a199e/channels?session_id=****', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/usr/lib/python3.10/site-packages/tornado/websocket.py", line 944, in _accept_connection
        await open_result
      File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
        future.result()
    asyncio.exceptions.TimeoutError: Timeout
[W 11:16:22.054 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:16:44.066 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:17:08.069 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:17:36.072 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:18:13.072 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****
[W 11:19:05.074 NotebookApp] Replacing stale connection: 48e5b1e1-f88a-4f95-858c-072ed03a199e:****

@KristofferC KristofferC added this to the 1.9 milestone Nov 20, 2022
@vtjnash
Copy link
Member

vtjnash commented Nov 22, 2022

Running everything with strace shows that the child Julia process launched by IJulia has segfaulted.

You are likely confusing a regular signal set to start GC collection with a failure. I only see the former, and don't see any failures.

I tracked the timeout back to https://github.com/JuliaLang/IJulia.jl/blob/cc2a9bf61a2515596b177339f9a3514de8c38573/src/heartbeat.jl, which now acquires the gc-lock, but then exits Julia code, and thus never gets an opportunity to release it.

@zickgraf
Copy link
Contributor Author

Indeed, with the latest nightly (Version 1.10.0-DEV.41 (2022-11-23), Commit 3200219) the behavior seems to have changed compared to my previous tests: Instead of crashing, Julia now runs at 100% CPU.

@fp4code
Copy link
Contributor

fp4code commented Dec 13, 2022

Can you share the error message that was shown . That you got when launching notebook

Confirmed!
Before regression, commit 981f3d2 2022-10-14:

...
[D 18:04:02.890 NotebookApp] Initializing websocket connection /api/kernels/73886252-3593-4b3f-afb8-e3869ca5105d/channels
[D 18:04:02.891 NotebookApp] Requesting kernel info from 73886252-3593-4b3f-afb8-e3869ca5105d
[D 18:04:02.891 NotebookApp] Connecting to: tcp://127.0.0.1:58655
Starting kernel event loops.
[D 18:04:06.866 NotebookApp] 200 GET /api/contents/Untitled12.ipynb?content=0&_=1670950925872 (127.0.0.1) 1.180000ms
...

After regression, commit c63c1e4 2022-10-14:

...
[D 18:02:06.136 NotebookApp] Initializing websocket connection /api/kernels/f0c4d2df-5ae5-4328-bcb4-e338495249df/channels
[D 18:02:06.137 NotebookApp] Requesting kernel info from f0c4d2df-5ae5-4328-bcb4-e338495249df
[D 18:02:06.137 NotebookApp] Connecting to: tcp://127.0.0.1:56469
[D 18:02:09.453 NotebookApp] activity on f0c4d2df-5ae5-4328-bcb4-e338495249df: status (starting)
Starting kernel event loops.
[W 18:03:06.137 NotebookApp] Timeout waiting for kernel_info reply from f0c4d2df-5ae5-4328-bcb4-e338495249df
...

@vtjnash
Copy link
Member

vtjnash commented Dec 13, 2022

Yes, this is an IJulia bug, where it enters a spin loop because of the foreign heartbeat_thread

@zickgraf
Copy link
Contributor Author

Cross-reference: I have reported this in the IJulia issue tracker in JuliaLang/IJulia.jl#1062.

@stevengj
Copy link
Member

stevengj commented Dec 16, 2022

Yes, this is an IJulia bug, where it enters a spin loop because of the foreign heartbeat_thread

Could you elaborate? IJulia's heartbeat code calls uv_thread_create to spawn a new libuv thread, but the new thread only calls zmq_proxy, which never returns. Where is the spinloop?

Oh, I see your comment above about not releasing the gc lock. How do we release this in order to call a C function that never returns?

@vtjnash
Copy link
Member

vtjnash commented Dec 16, 2022

Never returning from c code is disallowed, since the GC will wait forever to be able to access it or spawn work there

@vtjnash
Copy link
Member

vtjnash commented Dec 16, 2022

@maleadt was talking about adding that feature a couple months ago, but it is not implemented now. You can try enabling a GC safe region before the ccall though, and it probably won't crash.

@stevengj
Copy link
Member

I agree that this is a bit of an odd case, but the GC doesn't need to be involved in the heartbeat thread at all here. How do I enable a GC safe region?

@maleadt
Copy link
Member

maleadt commented Dec 20, 2022

IIUC:

state = ccall(:jl_gc_safe_enter, Int8, ())
# work
ccall(:jl_gc_safe_leave, Cvoid, (Int8,), state)

@stevengj
Copy link
Member

Thanks, that seems to work. Closed by JuliaLang/IJulia.jl@30ff84b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality regression Regression in behavior compared to a previous version
Projects
None yet
Development

No branches or pull requests

8 participants