-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving Network Performance #312
Comments
Hi, @mlucool thanks for the information. I started digging on this and found the following bottlenecks: Trying out the notebook you shared, the following lines are taking this amount of time on my computer: Read: 23.13 secs, and this is directly related to So reading the notebook with nbformat.read and nbformat.validate seem to be the biggest problems here. Now that this is isolated I can start looking directly into nbformat to find what can be improved. |
A simple script to test this. import nbformat
import time
TEST_FILE = '50000-errors.ipynb'
def test():
as_version = 4
start_time = time.time()
print("Start:\t0.00")
with open(TEST_FILE, 'r', encoding='utf-8') as f:
model = nbformat.read(f, as_version=as_version)
print("Open:\t"+ str(round(time.time() - start_time, 2)))
nbformat.validate(model)
print("Valid:\t"+ str(round(time.time() - start_time, 2)))
if __name__ == "__main__":
test() Yields in seconds:
|
Some more progress it seems the
So we could probably remove the extra validation on Jupyter_sever, or have an optional parameter on nbformat.read to not perform validation. Any thoughts on why validation is being performed twice @kevin-bates, @Zsailer, maybe there is a historical reason? This would already cut the time in half. Now validation is still taking ~10 seconds which seems like really too much time. Looking into validation now. Some reference docs: |
Thx @goanpeca for these findings. Not sure if there is an historical reason for this. I guess if we don't find a reason, we should ensure no double validation is applied. If we don't want to depend on PS: Any change in |
Hi @echarles, thanks for the feedback!
Can do :-) Are there any thoughts in actually replacing jsonschema with fastjsonschema? I will run some benchmarks to compare, but it seems like an easy win (if all test pass of course 🙃 ) |
Bumping this issue. It would be good to be able to serve static assets via nginx/apache as we could pre-compress them and use HTTP2 (which tornado does not support). One idea is jupyter-server could have an opt-in-features that creates a directory of symlinks to static assets. E.g. if I set
The same would also be done for extensions ( After this feature, we can document a minimal nginx config that adds HTTP2/compression and intercepts static resources from torando. Thoughts? |
I can't really comment on static asset performance, but did want to comment on something @goanpeca posted nearly a year and a half ago (and I apologize for not seeing this, or responding, sooner!):
A potential 50% savings on notebook fetches is significant and something, I believe, we should take a closer look at! The methods in question (read and validate), were written 7 to 8 years so I can't comment on their history. I suspect the reason for the second Unfortunately, 7 to 8 years is lots of water under the bridge and, as you point out, it seems like the only backward compatible change would be to add an optional parameter that indicates validation be skipped, knowing the server would circle back to perform validation itself. Had this been 8 years ago, it may have been better to raise Both Another thing worth trying would be to set env @goanpeca - would you be able to attach a copy of your |
@goanpeca was working on this for some of our use cases.
This has been great. We have been using NBFORMAT_VALIDATOR for ~1.5 years and have never had an issue with it. I'd recommend it as a the default as you suggested or at least changing the default to use it if it's installed:
I too think this is worthwhile too.
We have made this public in jupyterlab/benchmarks: 50000-errors. FWIW, locally this takes just under 3s before it starts sending the file, even with fastjsonschema. If you use a newer version of lab, lab won't blow up on you due to the work done to limit the number of mime renders per cell.
I can bring this idea up for dicussion at an upcoming juptyer-server meeting if that's a better way to move this foward. |
This is great info Marc - thank you!
Given the stability you've seen, this seems like an obvious thing to do. We could discuss whether this should be an opt-in (or opt-out) feature - perhaps flipping the default on a major release, for example.
Hmm, I see double validation is also performed when writing a notebook, so we'd probably want to do similar for those methods as well. I'm curious if @MSeal has an opinion on how to address the read/write methods - whether an optional "skip_validation" parameter would be reasonable or some other approach.
That's probably best - thank you. |
I played around making |
From what I have seen, it's very rare to have an error. If you wanted to push this forward without waiting for horejsek/python-fastjsonschema#72, you could use fastjsonschema and if that fails, fall back to jsonschema. If not, maybe we can advertise this more anymore and mention the downsides. |
Yeah that makes sense, I'll try that logic. |
Bringing up a couple of ideas discussed on this request, in addition to the nbformat/fastjsonschema fix -
This has been brought up again during a discussion of compression of static assets (jupyterlab/jupyterlab#13189). I'll try this setup and report back with numbers here. |
How can we optimally transfer assets to Jupyter clients (web browsers)?
Hypothesis: HTTP2 (i.e. no head of line blocking) and compression would meaningful improve page load and large notebook load performance.
Experiment: Create an nginx config that adds in ssl/http 2/compression and use it as a simple reverse proxy in front of a jupyterlab 2.x server. Then use chrome dev tools to understand changes to performance. In this setup my server and browser are not in the same physical location, but are connected by a high speed network. I had exactly one location block so static assets were still came via tornado.
Conclusion: Surprisingly, these technologies when naively put on top of a [email protected] server did not make a meaningful difference. The reverse proxy decreased the size of small assets, but increased the time for page load by ~10%. For large assets they clearly shrunk their size by a large amount 10x-23,000x (the latter is a generated very compressible test notebook) but the time to compress these on the fly meant there were minimal gains to be had. The ~10mb vendors bundle I had was compressed to 2.5Mb but took longer to get to the browser. A 33mb notebook shrank to 1.6kb still took about 30s either way. I'll note, most of my notebooks are small (<5MB).
I ran a second experiment where I put the notebook directly behind the same nginx server. In this case I was able to download the 33MB notebook in ~100ms!
In my view this experiment points to some large gains that can be had by letting assets skip the python server or thinning out the code path between the two. A few suggestions:
/static
)./static
jupyter should treat it as such as set the right headers (today I seeno-cache
set for example). Doing 1 should help people automatically do this, but doing this in jupyter_server may be useful for the average case.Pictures are worth 1000 words:
No optimizations page load:
Nginx page load:
No optimizations large notebook:
Nginx large notebook:
Directly sending the notebook (renamed to
foo.json
):cc @goanpeca
The text was updated successfully, but these errors were encountered: