Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Async Client / Kernel Manager #266

Open
JMurph2015 opened this issue Jun 23, 2017 · 11 comments
Open

Faster Async Client / Kernel Manager #266

JMurph2015 opened this issue Jun 23, 2017 · 11 comments

Comments

@JMurph2015
Copy link

JMurph2015 commented Jun 23, 2017

Hi!
This issue is being dropped here after discussion over on this thread:
jupyter-server/kernel_gateway#255

Intro

Basically the client interface to kernels is demonstrably adding a non-trivial amount of latency to every call made against a Jupyter notebook. Now one might ask "who cares about 8ms of latency?" and I would say anyone who would theoretically like to hit the Jupyter backend more than ~100 times a second. Reducing the latency to ~1ms should be achievable with the minimal overhead associated with ZMQ and the really minimal amount of post processing actually happening to any particular message.

What I've Seen So Far

  • My client code takes ~0.125ms on average to dispatch an execution call to the kernel gateway.
  • The code I'm sending to be executed is literally 'x=10+25**2'. When I run it 1000 times locally, error in the timing mechanism is still a similar order of magnitude aka estimating 10^-5 ms per run.
  • The code execution loop (time for the client to respond with results) is measured to take approximately 8ms overall.

This all means that I have ~7 ms (give or take 0.4ms for latency due to tcp and Websockets) that is unaccounted for. It would take a reasonably large amount of code to cause 7ms of delay (for example a simple prime sieve can find several thousand prime numbers in that sort of time).

Actual Question

If anyone has a good idea of what's taking 58x the time it takes me to dispatch calls to receive messages back, that would be awesome (it would be even more awesome if you could point me in the direction of changing that)

@takluyver
Copy link
Member

Besides the two network hops (websockets, zmq), we deserialise, reserialise, deserialise the message, and do a fair bit of manipulation in Python and Javascript. So I'm not sure whether there's one big cause of the slowness you see or dozens of small ones.

The code I'm sending to be executed is literally 'x=10+25**2'. When I run it 1000 times locally, error in the timing mechanism is still a similar order of magnitude aka estimating 10^-5 ms per run.

I think that's deceptive. Python doesn't do much optimisation, but one thing it does do is constant folding, so the bytecode for your code is equivalent to x=635.

# This roughly matches what you see
 12│ %%timeit
   │ x=10+25**2
   │ 
12.8 ns ± 0.0428 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

# If we prevent it from doing constant folding, it's an order of magnitude slower
 13│ a = 25

 14│ %%timeit
   │ x=10+a**2
   │ 
220 ns ± 1.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# Dynamic execution loses us another 2 orders of magnitude
 16│ %%timeit
   │ exec('x=10+a**2')
   │ 
   │ 
11.6 µs ± 44.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# And IPython's machinery is another order of magnitude over that.
# This still doesn't include the serialisation and TCP steps.
 18│ %%timeit
   │ get_ipython().run_cell('x=10+a**2')
339 µs ± 2.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@Carreau
Copy link
Member

Carreau commented Jun 23, 2017

I guess we could use an Echo Kernel to actually do performance measurement on jupyter_client, then you start the kernel with python -m cProfile [-o output_file] echokernel.py and dig into the output file to see what's the bottle neck is.

We can also look at https://github.com/QuantStack/xeus and make a Python kernel using the Python C API using this, and it should be faster. Though we should make sure that the bottleneck is still actually jupyter_client and not IPython itself.

@JMurph2015
Copy link
Author

JMurph2015 commented Jun 25, 2017

So one of the sentences of my original question was somewhat imprecise in what I was saying, it was taking 8ms to get results back from the client, not from the kernel directly as I had mentioned, the issue is now updated to reflect that.

Secondarily, while I get that performance is probably a death by a thousand cuts, question of whether the code is taking ~0 vs ~0.4 ms still is somewhat dwarfed by the 6.5 ms still unaccounted for.

Presently, I'm in the process of building some more direct interfacing with the kernels to see what happens when I manage the zmq traffic myself. To this end, is there any good class that I can use that just manages the startup and shutdown of the kernels, but isn't so tightly integrated with the client functionality? Ideally I would like something that just started a kernel and returned to me the url and port to point the zmq sockets at.

@JMurph2015
Copy link
Author

JMurph2015 commented Jun 26, 2017

Update:
I have gotten it down to 5.2 ms per hit over ipc, using an optimized client (that's about halfway done). Some of this may boil down to just overhead in string processing etc. and/or the IPython kernel itself, but the jury is still out on that, and I'll be digging around to see if I can do any better.

Edit: More results
Running against an echo_kernel (https://github.com/jupyter/echo_kernel) got the runtime down to 4.37ms, so there's about 0.9ms happening in the IPython kernel, which is acceptable I guess. This would mean that 1ms is unachievable here, but we could certainly shoot for 2ms (which is still a respectable 500 Hz).

@JMurph2015
Copy link
Author

JMurph2015 commented Jun 26, 2017

More results:
I ran timers at various points in the client process to see what was going on. The best metric by far was to log the time that I sent the message, then when processing 'execute_reply's calculating the time-of-flight for the kernel processing.

Here are the results in 50 run averages running against the echo kernel mentioned above:

Total Runtime Serialization Deserialization Time-of-flight
5.0646391 ms 0.2116923 ms 0.3838787 ms 4.2513698 ms

There were some other values I recorded in order to get a more complete picture, but they were implementation specific and thus wouldn't be of much use to you guys. I guess this is mostly an IPython problem / IPython kernel wrapper problem?

PS You might be wondering why this runtime is longer than the previously quoted value, it's just that I started waiting on the 'idle' status from the kernel so that I could collect the stdout and stderr streams. The echo kernel didn't get magically slower.

@Carreau
Copy link
Member

Carreau commented Jun 29, 2017

cc @SylvainCorlay – using https://github.com/QuantStack/xeus he was apparently able to decrease some overhead of the Python implementation of the protocol in some case from 1s to negligible.

@JMurph2015
Copy link
Author

JMurph2015 commented Jun 30, 2017

@Carreau @takluyver
I also did some testing on the blocking client in the current repo. This lead me to find a bug in my aync test script that left quite a bit of performance on the table (it was issuing commands only after the last one had finished despite using my async client class). So now I have updated numbers on performance across several kernels and the two clients. Times here are measured in milliseconds. These were measured by running a thousand test calls against an appropriately initialized kernel. Thus the warmup run for Python 3 looked like
from random import random\nx = random()\nprint(x)
and the actual test run sent instructions of
x = random()\nprint(x)
This was to avoid the constant folding issue from before since I don't know of any language having a way to cache results of an RNG usefully.

Kernel Async Blocking Percent Improvement
octave 7.429 8.414 12%
python3 4.121 5.072 19%
echo 3.140 4.310 27%
julia-0.6 2.226 2.189 -2%

Please note that the Julia numbers are using a patch that I have in place, but the PR for which hasn't come through yet.

So that brings me to my final question here. Would you guys be interested in an asyncio based client? I would have to do a little bit of asking, but I could probably make it happen. Of course, if it isn't really where this repo/project is going right now, that's fine too.

Update: Methodology comments

Thanks!

@rgbkrk
Copy link
Member

rgbkrk commented Sep 28, 2017

I'd be interested in an asyncio based client.

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Sep 28, 2017

If anyone has a good idea of what's taking 58x the time it takes me to dispatch calls to receive messages back, that would be awesome (it would be even more awesome if you could point me in the direction of changing that)

@JMurph2015 is websocket compression enabled in your configuration?

See: jupyter/notebook#2490

@JMurph2015
Copy link
Author

@SylvainCorlay Sorry for being out of the loop on this project for so long, my fall was really busy! So I will try to whip up another PoC as most of my materials were with my summer internship, but I think I could get the bugs ironed out w.r.t. an asyncio client.

@JMurph2015
Copy link
Author

And I'm not sure if it was, I'll have to get a new test setup and rerun the numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants