-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel nanny proposal #14
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Kernel 'nanny' processes | ||
|
||
## Summary | ||
|
||
We propose to start Jupyter kernels through a 'nanny' process, which will always | ||
be running on the same machine as its associated kernel. This offers various | ||
advantages over the current situation, including: | ||
|
||
- Kernels will no longer need to implement the 'heartbeat' for frontends to | ||
check that they are still alive. | ||
- We will be able to interrupt remote kernels (SIGINT cannot be sent over the network) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about a message like shutdown_request? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's a kernel implementation detail. IJavascript doesn't suffer from that problem (suffers from others, though 😛 ). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the perspective of writing a native app, we'll do this directly: sending |
||
- There will be a consistent way to start kernels without a frontend | ||
(`jupyter kernel --kernel x`). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps I'm misunderstanding the proposal. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This brings me back to what I was thinking when we talked about the kernel nanny, which is that we should consider a general daemon for launching kernels on a system, similar to docker's interface (CLI + API). |
||
- Kernel stdout & stderr can be captured at the OS level, with real-time updates | ||
of output. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
|
||
## The basics | ||
|
||
When a frontend wants to start a kernel, it currently instantiates a `KernelManager` | ||
object which reads the kernelspec to find how to start the kernel, writes a | ||
connection file, and launches the kernel process. With the proposed changes, it will | ||
instead launch the kernel nanny on the machine where the kernel is to run, and | ||
the nanny will be responsible for creating the connection file and launching | ||
the kernel process. | ||
|
||
**Rejected alternative:** One kernel nanny process per machine, able to start | ||
multiple kernels. This would be more complex, but we may come back to it later | ||
if the overhead of one nanny per kernel is too much. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, this is down the one nanny per kernel approach now. Glad you cleared this up. |
||
## Socket connections | ||
|
||
Currently, the frontend connects to five sockets to communicate with the kernel: | ||
|
||
* Shell | ||
* Control (priority, used for shutdown) | ||
* Iopub (kernel to frontend only, for output) | ||
* Stdin (used to request input from the frontend) | ||
* Heartbeat | ||
|
||
Of these, shell and stdin will remain connected directly between the kernel and | ||
the frontend. Control and iopub (see output capturing) will be connected through | ||
the nanny, i.e. each channel will have one socket for communications between | ||
the frontend and the nanny, and a second socket for communications between the | ||
nanny and the kernel. (*TODO: What are these called in the connection file? Or | ||
do we have two connection files?*) The heartbeat will only be between the | ||
frontend and the nanny, to detect situations such as network failures. | ||
|
||
## Messaging changes | ||
|
||
* A new message type on the control channel from the frontend to the nanny, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not have the nanny capture https://ipython.org/ipython-doc/3/development/messaging.html#kernel-shutdown |
||
instructing the nanny to shut down the kernel. | ||
* A new message type on the control channel from the frontend to the nanny, | ||
instructing the nanny to signal/interrupt the kernel. (*TODO: Expose all Unix | ||
signals, or just SIGINT?*) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Unix systems that certainly makes sense. For Windows, should we just pick some numbers to refer to the available ways we have of interrupting/stopping the kernel process? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think only one or two signals work on Windows reliably, but they are still integers, aren't they? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AIUI Windows doesn't really have signals at all, but Python exposes certain similar operations through the same interface it uses for signals on Windows. The description of os.kill has some useful info: https://docs.python.org/3/library/os.html#os.kill We could quite reasonably expose the same set of options with the same meanings as Python does, of course. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that the nanny process is going to run in the same machine as the kernel, it makes sense that the nanny process is asked to interrupt the kernel by means of a message similar to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, that's exactly how this will work. We're just trying to work out what form the message will take. If all the world was Unix, we'd almost certainly just call it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the windows problems, see here: jupyter/jupyter_client#104 |
||
* Heartbeat becomes a broadcast signal from the nanny to all connected frontends, | ||
rather than a REP/REQ pattern (which was only the case before because pyzmq | ||
makes it easy to echo messages without grabbing the GIL). | ||
* New broadcast message from nanny to frontends when kernel dies unexpectedly, | ||
including exit status. | ||
* New end of output message, from nanny to frontends? (*TODO: yes/no?*) | ||
|
||
## Output capturing | ||
|
||
In IPython, we capture stdout/stderr at the Python level (sys.std*). Code which | ||
writes to stdout/stderr at a lower level (e.g. C extensions) will send its output | ||
to the terminal where the frontend was started, instead of to the frontend. | ||
Many other kernels suffer from similar issues. | ||
|
||
We know of tricks using `dup2` to redirect the low-level file handles within the | ||
kernel, but we don't want each kernel to reimplement this, and it is not | ||
possible on Windows. | ||
|
||
To this end, when the kernel nanny starts the kernel, it will be able to create | ||
stdout and stderr as pipes, and turn data read from them into *stream* messages | ||
to be sent to the frontend via the iopub channel. However, this may make | ||
debugging difficult or show unwanted output if kernel authors are using the | ||
terminal to debug the kernel implementation. Therefore, output capturing will | ||
only be enabled if the kernel opts in via its `kernel.json` specification: | ||
|
||
"capture_stdstreams": true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A problem that this introduces that we should address is that now kernels cannot write to the terminal at all - there is no way for kernels to have logging without going straight to a file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (duh, I should finish reading, I see There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. other potential problems:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can you expand on this? I don't follow what you mean.
That shouldn't be a problem: the notebook does that by queueing an execution request for each cell. It has to do that for the output from each cell to go to the right place. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For instance, when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's precisely the kind of reason that output capturing will be opt-in - kernels need to be ready for it, and then kernel authors can flip the switch to enable it. It won't be enabled for earlier versions of IJavascript, so that won't be a problem. |
||
|
||
### Output synchronisation | ||
|
||
With this proposal, there are multiple asynchronous channels for output coming | ||
from the kernel: the `iopub` socket from the kernel to the nanny, and the pipes | ||
carrying stdout and stderr. At present, the `status` message with | ||
`execution_state: idle` marks the end of output on the iopub channel. | ||
|
||
Kernels that opt in to output capturing should print a delimiter (*TODO: define | ||
delimiter*) on each of stdout and stderr, before and after running user code. | ||
The delimiter will include the message ID of the execute_request message. | ||
The nanny will not forward these to the frontend, but will use the 'before' | ||
delimiters to indicate which execution output resulted from, and the 'after' | ||
delimiters to detect when stream output is finished. The nanny will tell the | ||
frontend when all output from an execute_request, on iopub and the two pipes, | ||
is complete (*TODO: using status:idle, or a new message?*). | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is assuming synchronous execution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is, but that's already generally our assumption, and I don't see any other way to do it. There is no side channel in sync with stdout/stderr by which we can convey metadata like the parent message ID, so it has to be in-band. This is all in addition to the existing mechanisms for kernels to send output, so kernels for which async is really important should focus on capturing output in process and sending the correct messages. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would let the kernel handle that. If the kernel can identify the source a stream message, let the kernel send the appropriate iopub reply. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, if the kernel is sending the stream message itself, it should absolutely do that. This is for the case where the data is going over the stdout/stderr pipes. If you can capture all stdout/stderr well enough within IJavascript, there's no need for you to enable output capturing. |
||
**Rejected alternative:** Frontends monitoring when output is complete on each | ||
channel. The frontend output handling logic would have to know whether the | ||
kernel in use used output capturing or not, and the logic would have to be | ||
written for each frontend. With the scheme we decided upon, the logic must only | ||
be implemented once in the nanny process, and the frontend can remain ignorant | ||
of whether the kernel has enabled output capturing. | ||
|
||
### Kernel logging | ||
|
||
Where kernels have been previously using low-level stdout/stderr to log to the | ||
terminal, they need a new way to produce diagnostic logs which shouldn't be | ||
displayed in the frontend. Kernels opting in to output capturing will be | ||
started with an environment variable `JUPYTER_KERNEL_LOG` set. The kernel | ||
should treat this as a filesystem path, which it can open and write logs to. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: add a new type of message for kernels to send their log messages to the frontend. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion, but this logging is what you're going to be using if your kernel's messages are not getting to the frontend for whatever reason. So I think it really needs to be a) a separate channel, and b) as technically simple as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can the kernel log output be disabled? If yes, how is that indicated? Variable not set, or set to an empty value? |
||
Kernels should make minimal assumptions about the type of file they are opening. | ||
It may be a regular file, a FIFO (or named pipe), or the slave end of the tty | ||
where the frontend was started, on systems where that is possible. This will | ||
likely depend on configuration settings in the frontend, and possibly on how the | ||
frontend is started. It should never be a directory, however. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the nanny process check the kernel is alive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
subprocess.Popen.poll()
, but depending on how it's written, there may well be smarter ways. On Unix, the parent process is sentSIGCHLD
when one of its children dies.