Remove blocking Kernel Startup #86

aazhou1 · 2017-07-21T22:17:20Z

Single-threaded kernel startup causing notebook front end to lose connection to gateway as a lot of kernels are being started--so that notebook has to re-pull kernel specs from gateway after Elyra finishes starting kernels.

kevin-bates · 2017-07-26T16:52:51Z

I believe this issue is a result of the duration it takes to "start" a kernel relative to a remote cluster. In normal circumstances (local kernels) the duration is on the order of a second (to invoke the process and have the kernel respond). For remote kernels running as yarn-cluster, that duration is an order of magnitude longer (10-15 seconds) which exacerbates as blocking startups on the front end.

I tried to look into this early on but couldn't make any headway because the Jupyter framework assumes the kernel is communicating immediately on return from the startup logic (so using a thread doesn't really help).

kevin-bates · 2017-08-28T20:58:58Z

I haven't been able to make any headway into this. The MappingKernelManager is async in that it uses the @gen.coroutine decorator. This "multi kernel manager" instantiates a "single kernel manager" - which is responsible for starting the kernel - that, when remote - takes several seconds.

I tried introducing asynchronous behaviors at various levels within this framework - either at the point of launch, higher up at the start of the "single kernel manager", or even higher within the "multi-kernel manager" sequence - as well as replacing time.sleep with yield gen.sleep. In all cases, any form of yield returns to the caller - as if the call did not occur. This causes control to be returned back to the user too soon.

In addition, I suspect the current asynchronous intentions are not working as expected, but that has gone unnoticed because regular kernel startups are on the order of one, maybe two, seconds.

Here are some links I found that seemed applicable to what we need:
http://blog.trukhanov.net/Running-synchronous-code-on-tornado-asynchronously/
https://stackoverflow.com/questions/18088176/calling-tornado-coroutines-from-synchronous-code
https://stackoverflow.com/questions/13051591/tornado-blocking-asynchronous-requests

Placing back on the backlog for now. It would be great if someone else could spend some time with this. I'd be happy to share more about what I tried.

Carreau · 2018-11-12T00:03:04Z

As promised in another thread, some comments on this; Apologies for the delay, I've been pretty busy on other things (like deleting root's home on one of the cluster I manage because i'm stupid...)

I believe there is a path forward that keep backward compatibility but will allow us to slowly migrate to async-almost-everything and not block kernel launching. I'm going to use python 3 syntax, but it should be possible to do that on Python 2 with @gen.coroutine/yield but it's less clear in explanations.

We need slowly migrate all the current API from def function() to async def function(), and prefer the second one. The nice thing is that async def function() can be blocking, that is not a problem, but it's hard to make def function() non-blocking.

For client that do not run in an event-loop, and are "quasi sync", you can fake an event loop that just advance everything like in there

We need to provide backward compatibility, that is to say we need to provide, and be able to call call 2 kinds of methods:

Synchone def foo()
Asynchrone async def foo().

In python 3 at least, you can figure out whether a function is synchrone using inspect.iscoroutinefunction. If it's sync then we just call it, if it's async we yield from it.

I suggest the following for every method def XXX(...) that participate in kernel management we define:

Make a async def XXX_async() version that is provisional but async.

For every method foo we currently call check anc call in order:

async def foo
async def foo_async
def foo

Everytime we encounter a non-async function, or a non-async function get called have extensive deprecation warning that ask users to:

make their function async if it's sync.
Provide them with a simple way of upgrading their code to call both async and non-async function.

I've been toying with that in jupyter/jupyter_client#402, and jupyter/notebook#4126, I just need to find the right abstraction and pattern to allow the transition to be as seamless as possible for users.

As far as I can tell, we should be able to do a transition where all of the new API are compatible with the future ones with minimal disruptions and code change. I believe with the right tooling, documentation and example that should be something relatively systematic to do, so the would just be a question of time and outreach to review all the possible places that need to be updated.

Apologies if this is a bit unclear, I'm still trying to arrange things in my head, and I still need to dive into kernel and enterprise gateway to see how it affects or would simplify those.

kevin-bates · 2018-11-12T17:48:35Z

@Carreau - thank you for your comment. Yes, I've been monitoring your WIP PRs and agree that its best to see how the dust settles on those. In addition, @takluyver's WIP PR may essentially eliminate all kernel management from Enterprise Gateway anyway - so we need to wait to see where his effort takes us. I would recommend you coordinate with Thomas, although I suspect you have already done that. Looks like Thomas is accounting for async kernel providers also - all good.

In the meantime, we will continue to monitor these kernel management PRs and contribute where we can. Our primary interest is also in preserving the existing functionality we provide and to minimize any migration challenges that lie ahead when adopting these new architectures.

takluyver · 2018-11-13T17:19:17Z

Looks like Thomas is accounting for async kernel providers also - all good.

To be honest, not really. I tried to figure this out, and there are launch_async methods in the API, but it all got too messy, and I've wound up using only the blocking APIs to launch and manage kernels. I have a working async client built on the tornado interface, but not the provider/manager part.

Part of the muddle I got into was that I was planning to build everything directly on asyncio, but then there's a nice ZMQ 'stream' interface built on tornado, and then I started reading about trio, and...

Updated start_kernel and restart_kernel methods to coroutines. Converted various `time.sleep()` to `yield gen.sleep()` which enables tornado to "slice-out" to a different request. As a result, converting `confirm_remote_startup()` and `handle_timeout()` to coroutines was key. Fixes jupyter-server#86

esevan · 2019-02-20T20:46:49Z

@kevin-bates Would you mind if I ask nb2kg image for this patch?

I'll appreciate it if any guide is given as well for upgrading jupyter notebook and client in nb2kg Dockerfile.

kevin-bates · 2019-02-20T21:20:12Z

@esevan - Although I performed validation of the changes via the REST API, an update to the nb2kg image shouldn't be necessary since the actual changes to enable concurrency really take place in EG via #580. The notebook and jupyter_client changes essentially facilitate the changes in EG. If you're finding that not to be the case, please let me know along with how you go about starting kernels concurrently via the NB2KG image and I'd be happy to take a look.

esevan · 2019-02-20T22:20:47Z

@kevin-bates I'm using this image as my notebook environment. https://github.com/jupyter/enterprise_gateway/blob/master/etc/docker/nb2kg/Dockerfile

I expected this included notebook and client, that It should be changed to facilitate #580.

Updated start_kernel and restart_kernel methods to coroutines. Converted various `time.sleep()` to `yield gen.sleep()` which enables tornado to "slice-out" to a different request. As a result, converting `confirm_remote_startup()` and `handle_timeout()` to coroutines was key. Fixes jupyter-server#86

kevin-bates · 2019-02-20T23:25:04Z

That's correct that the image uses notebook and jupyter_client, but the actual kernel startup and management gets proxied (via NB2KG) to the Enterprise Gateway server. That's where the practical portion of the concurrency changes reside. The NB and JC changes are merely to turn some existing methods into coroutines, but there weren't any additional changes made that trigger significant concurrency. The primary change was to replace the time.sleep() calls with yield gen.sleep() within confirm_remote_startup() and handle_timeout(), but to do that various calling methods (that include superclass methods in NB/JC) required they be decorated with @gen.coroutine.

I've attached the 3 dev-built whl files for EG, NB and JC. You must pip install EG before NB since the NB version violates EG's constraints.
async-startup-wheels.zip

If you only need the applicable EG image, I've pushed one that can be pulled using: docker pull kbates/enterprise-gateway:async

I hope that helps.

esevan · 2019-02-20T23:37:42Z

@kevin-bates So you just modified EG process embedding JC and NB. I thought client side was also changed, and now I got it. Thanks a lot!

esevan · 2019-02-24T07:01:26Z

I've studied kernel startup source codes working in my environment (k8s) on my free time.
Here's just a graffiti as a result of it. Just wanna share this for anyone studying enterprise_gateway in source code level like me. Now I fully understood why I don't need nb2kg image.

kevin-bates · 2019-02-24T19:23:36Z

@esevan - this is excellent!

Updated start_kernel and restart_kernel methods to coroutines. Converted various `time.sleep()` to `yield gen.sleep()` which enables tornado to "slice-out" to a different request. As a result, converting `confirm_remote_startup()` and `handle_timeout()` to coroutines was key. Fixes jupyter-server#86

LK-Tmac1 added this to the Backlog milestone Jul 21, 2017

kevin-bates added runtime hadoop yarn labels Jul 26, 2017

kevin-bates modified the milestones: Sprint 8, Backlog Aug 21, 2017

kevin-bates self-assigned this Aug 21, 2017

kevin-bates modified the milestones: Backlog, Sprint 8 Aug 28, 2017

kevin-bates removed their assignment Aug 29, 2017

kevin-bates added the help wanted label Aug 29, 2017

lresende self-assigned this Aug 29, 2017

lresende modified the milestones: Sprint 9, Backlog Aug 29, 2017

lresende modified the milestones: Sprint 10, Sprint 9 Sep 11, 2017

lresende modified the milestones: Sprint 10, Sprint 11 Sep 21, 2017

lresende modified the milestones: Sprint 11, Backlog Oct 7, 2017

lresende removed their assignment Oct 14, 2017

lresende self-assigned this Nov 6, 2017

lresende removed this from the Backlog milestone Nov 6, 2017

kevin-bates added this to the v0.8 Backlog milestone Nov 7, 2017

kevin-bates removed this from the v0.8 Backlog milestone Mar 26, 2018

kevin-bates changed the title ~~Single-threaded Kernel Startup~~ Remove blocking Kernel Startup Apr 3, 2018

kevin-bates added this to the v1.0 milestone May 11, 2018

kevin-bates modified the milestones: v1.0, v2.0 Jun 1, 2018

kevin-bates removed the hadoop yarn label Jul 24, 2018

amangarg96 mentioned this issue Dec 5, 2018

Exception writing to websocket #516

Closed

xhumanoid mentioned this issue Jan 2, 2019

handle communication with kernel launchers external to k8s #532

Closed

kevin-bates mentioned this issue Jan 13, 2019

Handling concurrent requests on Kubernetes cluster #544

Closed

kevin-bates mentioned this issue Feb 16, 2019

[WIP] Changes necessary to get async kernel startup working jupyter/jupyter_client#425

Closed

kevin-bates assigned kevin-bates and unassigned lresende Feb 19, 2019

kevin-bates added dependency and removed help wanted labels Feb 19, 2019

kevin-bates mentioned this issue Feb 19, 2019

[JUPYTER DEPENDENCY][WIP] Add support for async kernel management #580

Closed

amangarg96 mentioned this issue Nov 28, 2019

Configuring the request queue size in JEG #757

Closed

kevin-bates mentioned this issue Feb 26, 2020

Kernel creation blocked when other Pod is pending in Kubernetes #784

Closed

kevin-bates mentioned this issue Mar 20, 2020

Add support for async kernel management #794

Merged

kevin-bates closed this as completed in #794 Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove blocking Kernel Startup #86

Remove blocking Kernel Startup #86

aazhou1 commented Jul 21, 2017

kevin-bates commented Jul 26, 2017

kevin-bates commented Aug 28, 2017 •

edited

Loading

Carreau commented Nov 12, 2018

kevin-bates commented Nov 12, 2018

takluyver commented Nov 13, 2018

esevan commented Feb 20, 2019

kevin-bates commented Feb 20, 2019

esevan commented Feb 20, 2019

kevin-bates commented Feb 20, 2019

esevan commented Feb 20, 2019

esevan commented Feb 24, 2019

kevin-bates commented Feb 24, 2019

Remove blocking Kernel Startup #86

Remove blocking Kernel Startup #86

Comments

aazhou1 commented Jul 21, 2017

kevin-bates commented Jul 26, 2017

kevin-bates commented Aug 28, 2017 • edited Loading

Carreau commented Nov 12, 2018

kevin-bates commented Nov 12, 2018

takluyver commented Nov 13, 2018

esevan commented Feb 20, 2019

kevin-bates commented Feb 20, 2019

esevan commented Feb 20, 2019

kevin-bates commented Feb 20, 2019

esevan commented Feb 20, 2019

esevan commented Feb 24, 2019

kevin-bates commented Feb 24, 2019

kevin-bates commented Aug 28, 2017 •

edited

Loading