Faster Async Client / Kernel Manager #266

JMurph2015 · 2017-06-23T06:50:07Z

Hi!
This issue is being dropped here after discussion over on this thread:
jupyter-server/kernel_gateway#255

Intro

Basically the client interface to kernels is demonstrably adding a non-trivial amount of latency to every call made against a Jupyter notebook. Now one might ask "who cares about 8ms of latency?" and I would say anyone who would theoretically like to hit the Jupyter backend more than ~100 times a second. Reducing the latency to ~1ms should be achievable with the minimal overhead associated with ZMQ and the really minimal amount of post processing actually happening to any particular message.

What I've Seen So Far

My client code takes ~0.125ms on average to dispatch an execution call to the kernel gateway.
The code I'm sending to be executed is literally 'x=10+25**2'. When I run it 1000 times locally, error in the timing mechanism is still a similar order of magnitude aka estimating 10^-5 ms per run.
The code execution loop (time for the client to respond with results) is measured to take approximately 8ms overall.

This all means that I have ~7 ms (give or take 0.4ms for latency due to tcp and Websockets) that is unaccounted for. It would take a reasonably large amount of code to cause 7ms of delay (for example a simple prime sieve can find several thousand prime numbers in that sort of time).

Actual Question

If anyone has a good idea of what's taking 58x the time it takes me to dispatch calls to receive messages back, that would be awesome (it would be even more awesome if you could point me in the direction of changing that)

takluyver · 2017-06-23T11:07:27Z

Besides the two network hops (websockets, zmq), we deserialise, reserialise, deserialise the message, and do a fair bit of manipulation in Python and Javascript. So I'm not sure whether there's one big cause of the slowness you see or dozens of small ones.

The code I'm sending to be executed is literally 'x=10+25**2'. When I run it 1000 times locally, error in the timing mechanism is still a similar order of magnitude aka estimating 10^-5 ms per run.

I think that's deceptive. Python doesn't do much optimisation, but one thing it does do is constant folding, so the bytecode for your code is equivalent to x=635.

# This roughly matches what you see
 12│ %%timeit
   │ x=10+25**2
   │ 
12.8 ns ± 0.0428 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

# If we prevent it from doing constant folding, it's an order of magnitude slower
 13│ a = 25

 14│ %%timeit
   │ x=10+a**2
   │ 
220 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# Dynamic execution loses us another 2 orders of magnitude
 16│ %%timeit
   │ exec('x=10+a**2')
   │ 
   │ 
11.6 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# And IPython's machinery is another order of magnitude over that.
# This still doesn't include the serialisation and TCP steps.
 18│ %%timeit
   │ get_ipython().run_cell('x=10+a**2')
339 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Carreau · 2017-06-23T14:30:23Z

I guess we could use an Echo Kernel to actually do performance measurement on jupyter_client, then you start the kernel with python -m cProfile [-o output_file] echokernel.py and dig into the output file to see what's the bottle neck is.

We can also look at https://github.com/QuantStack/xeus and make a Python kernel using the Python C API using this, and it should be faster. Though we should make sure that the bottleneck is still actually jupyter_client and not IPython itself.

JMurph2015 · 2017-06-25T01:03:47Z

So one of the sentences of my original question was somewhat imprecise in what I was saying, it was taking 8ms to get results back from the client, not from the kernel directly as I had mentioned, the issue is now updated to reflect that.

Secondarily, while I get that performance is probably a death by a thousand cuts, question of whether the code is taking ~0 vs ~0.4 ms still is somewhat dwarfed by the 6.5 ms still unaccounted for.

Presently, I'm in the process of building some more direct interfacing with the kernels to see what happens when I manage the zmq traffic myself. To this end, is there any good class that I can use that just manages the startup and shutdown of the kernels, but isn't so tightly integrated with the client functionality? Ideally I would like something that just started a kernel and returned to me the url and port to point the zmq sockets at.

JMurph2015 · 2017-06-26T17:40:28Z

Update:
I have gotten it down to 5.2 ms per hit over ipc, using an optimized client (that's about halfway done). Some of this may boil down to just overhead in string processing etc. and/or the IPython kernel itself, but the jury is still out on that, and I'll be digging around to see if I can do any better.

Edit: More results
Running against an echo_kernel (https://github.com/jupyter/echo_kernel) got the runtime down to 4.37ms, so there's about 0.9ms happening in the IPython kernel, which is acceptable I guess. This would mean that 1ms is unachievable here, but we could certainly shoot for 2ms (which is still a respectable 500 Hz).

JMurph2015 · 2017-06-26T22:04:20Z

More results:
I ran timers at various points in the client process to see what was going on. The best metric by far was to log the time that I sent the message, then when processing 'execute_reply's calculating the time-of-flight for the kernel processing.

Here are the results in 50 run averages running against the echo kernel mentioned above:

Total Runtime	Serialization	Deserialization	Time-of-flight
5.0646391 ms	0.2116923 ms	0.3838787 ms	4.2513698 ms

There were some other values I recorded in order to get a more complete picture, but they were implementation specific and thus wouldn't be of much use to you guys. I guess this is mostly an IPython problem / IPython kernel wrapper problem?

PS You might be wondering why this runtime is longer than the previously quoted value, it's just that I started waiting on the 'idle' status from the kernel so that I could collect the stdout and stderr streams. The echo kernel didn't get magically slower.

Carreau · 2017-06-29T12:10:58Z

cc @SylvainCorlay – using https://github.com/QuantStack/xeus he was apparently able to decrease some overhead of the Python implementation of the protocol in some case from 1s to negligible.

JMurph2015 · 2017-06-30T07:49:31Z

@Carreau @takluyver
I also did some testing on the blocking client in the current repo. This lead me to find a bug in my aync test script that left quite a bit of performance on the table (it was issuing commands only after the last one had finished despite using my async client class). So now I have updated numbers on performance across several kernels and the two clients. Times here are measured in milliseconds. These were measured by running a thousand test calls against an appropriately initialized kernel. Thus the warmup run for Python 3 looked like
from random import random\nx = random()\nprint(x)
and the actual test run sent instructions of
x = random()\nprint(x)
This was to avoid the constant folding issue from before since I don't know of any language having a way to cache results of an RNG usefully.

Kernel	Async	Blocking	Percent Improvement
octave	7.429	8.414	12%
python3	4.121	5.072	19%
echo	3.140	4.310	27%
julia-0.6	2.226	2.189	-2%

Please note that the Julia numbers are using a patch that I have in place, but the PR for which hasn't come through yet.

So that brings me to my final question here. Would you guys be interested in an asyncio based client? I would have to do a little bit of asking, but I could probably make it happen. Of course, if it isn't really where this repo/project is going right now, that's fine too.

Update: Methodology comments

Thanks!

rgbkrk · 2017-09-28T00:44:39Z

I'd be interested in an asyncio based client.

SylvainCorlay · 2017-09-28T06:39:02Z

If anyone has a good idea of what's taking 58x the time it takes me to dispatch calls to receive messages back, that would be awesome (it would be even more awesome if you could point me in the direction of changing that)

@JMurph2015 is websocket compression enabled in your configuration?

See: jupyter/notebook#2490

JMurph2015 · 2018-01-12T16:38:20Z

@SylvainCorlay Sorry for being out of the loop on this project for so long, my fall was really busy! So I will try to whip up another PoC as most of my materials were with my summer internship, but I think I could get the bugs ironed out w.r.t. an asyncio client.

JMurph2015 · 2018-01-12T16:39:07Z

And I'm not sure if it was, I'll have to get a new test setup and rerun the numbers.

maartenbreddels mentioned this issue Aug 30, 2019

feat: add client class that uses async/await to integrate with asyncio #471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Async Client / Kernel Manager #266

Faster Async Client / Kernel Manager #266

JMurph2015 commented Jun 23, 2017 •

edited

Loading

takluyver commented Jun 23, 2017

Carreau commented Jun 23, 2017

JMurph2015 commented Jun 25, 2017 •

edited

Loading

JMurph2015 commented Jun 26, 2017 •

edited

Loading

JMurph2015 commented Jun 26, 2017 •

edited

Loading

Carreau commented Jun 29, 2017

JMurph2015 commented Jun 30, 2017 •

edited

Loading

rgbkrk commented Sep 28, 2017

SylvainCorlay commented Sep 28, 2017 •

edited

Loading

JMurph2015 commented Jan 12, 2018

JMurph2015 commented Jan 12, 2018

Faster Async Client / Kernel Manager #266

Faster Async Client / Kernel Manager #266

Comments

JMurph2015 commented Jun 23, 2017 • edited Loading

Intro

What I've Seen So Far

Actual Question

takluyver commented Jun 23, 2017

Carreau commented Jun 23, 2017

JMurph2015 commented Jun 25, 2017 • edited Loading

JMurph2015 commented Jun 26, 2017 • edited Loading

JMurph2015 commented Jun 26, 2017 • edited Loading

Carreau commented Jun 29, 2017

JMurph2015 commented Jun 30, 2017 • edited Loading

rgbkrk commented Sep 28, 2017

SylvainCorlay commented Sep 28, 2017 • edited Loading

JMurph2015 commented Jan 12, 2018

JMurph2015 commented Jan 12, 2018

JMurph2015 commented Jun 23, 2017 •

edited

Loading

JMurph2015 commented Jun 25, 2017 •

edited

Loading

JMurph2015 commented Jun 26, 2017 •

edited

Loading

JMurph2015 commented Jun 26, 2017 •

edited

Loading

JMurph2015 commented Jun 30, 2017 •

edited

Loading

SylvainCorlay commented Sep 28, 2017 •

edited

Loading