-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to transfer binary data without serialization (CC #67) #103
Conversation
Thanks for working on this. I agree sending binary data/buffer data is going to be very slow at present. The data has to be encoded as UTF8, JSON serialized, UFT8 decoded, JSON parsed, then converted into the desired data type. This is especially bad as JSON.stringify on a Buffer will write all the bytes in a comma delimited ASCII array. Ideally, length-prefixed binary data would be directly written over the stdio pipes without those transformations, however that's a bit challenging as we use Python's .readline() stream reading API, that uses newlines to delimit different data packets (and the binary data can itself contain newlines or null terms). This will require some thought on the best way to optimize the communication, maybe even refactor, so I'll have to think about it some more. One easy way I can think to improve the performance, at least slightly, via writing serializing less data is to send Base64-encoded binary Buffer data instead of JSON. Would be great to have a script to benchmark the existing implementation vs base64. |
Thanks for the response!
Yes, I've been wondering about that, too. The proposed patch basically handles this by immediately parsing the server's responses and giving the parser access to the stream to fetch possible data after the first
Yes, I expect base64 would be better than a utf-8 encoded, comma-delimited json array, but it'd still imply an encode/decode layer and some size increase. This patch should result in more performance gain for blob transfer than just base64. |
Sounds good, can you fix the conflict by merging master? Adding a test to https://github.com/extremeheat/JSPyBridge/blob/master/test/javascript/test_general.py to make sure this works would be great, then we can merge this PR. Similar functionality would probably make sense for Python -> JS, but that can be explored later. |
Yes, I plan to do the rebase/test/docs (and maybe also add the pdfjs example?), but it might take some time until I can resume working on this. You're right, something similar would be needed for Py to JS, but I haven't looked at that side of the communication yet, since it was not my immediate use case. But IIRC pdfjs also supports opening PDFs from a byte string, so we might get to that as well sooner or later. |
10ceb3f
to
e98823d
Compare
@extremeheat I believe this may be ready for review/merge. Could you please check the Jupyter notebook code passage is still working after the change? I'm not familiar with this tool. |
This comment was marked as outdated.
This comment was marked as outdated.
753ef4a
to
f753c9b
Compare
b06913e
to
caef781
Compare
It sometimes works, sometimes not. See below for examples: Working: https://cinelerra-gg.org/download/CinelerraGG_Manual.pdf https://pikepdf.readthedocs.io/_/downloads/en/latest/pdf/ Failing: https://www.gemeinde-grindelwald.ch/wp-content/uploads/2020/06/Broschuere_Landwirtschaft_und_Tourismus_Franzoesisch.pdf -> getHexString reports invalid characters. File renders when downloaded, though. Further, the document below produces an out of memory error, but also when rendering locally https://bro.jrtag.ch/Gri/so_guide/de_en/files/downloads/Infoguide_Sommer_23-WEB.pdf
Now a macOS test is failing, but I don't understand the error. Seems to be something with a mismatched |
Sorry for the delay here, I'll be testing this locally as soon as possible and merge if no problems arise. Thanks for the work on this :) |
Thanks, I yet pushed some final style touches. Hope the Readme changes are OK. I tried to rephrase some things intending to make them easier to understand, but am somewhat afraid I might have got something wrong due to lack of knowledge of the JS/IPC subject. |
When merging, would you please also squash the commits? Thanks! |
Thanks for the work on this! |
Thanks for merging! |
Would you mind triggering a new release for this? |
/makerelease |
Hmm, while this is certainly faster than before, it's still in the nature of data piping to be rather slow. For a sequence of static or average size discardable data, I imagine shared memory would be a lot better. I don't know how expensive it is to allocate shared memory, or if there are specific limits. Is anyone aware of a cross-platform, maintained shared memory library in JS that could be connected with Python's |
Reverts change #103 as sys.stdout.fileno doesn't exist on Colab notebooks; also fix color outputs in notebooks
Reverts change #103 as sys.stdout.fileno doesn't exist on Colab notebooks; also fix color outputs in notebooks
I think I finally found a way to use shared memory via
A missing piece is, I don't know how to create a canvas backed by an external buffer, so we currently have to copy into the shared memory segment. Seems like the canvas library might not provide an API to do this. There are also some cleanup issues to be figured out (apparently, the shared memory is not destroyed on keyboard interrupt, despite try/finally wrapper). import time
import mmap
import argparse
from pathlib import Path
# third-party
import PIL.Image
import javascript
import posix_ipc
# NOTE canvas must be the build pdfjs is linked against, otherwise it'll fail with type error
pdfjs = javascript.require("pdfjs-dist")
libcanvas = javascript.require("canvas")
libshm = javascript.require("shm-typed-array")
def render_pdf(input, outdir, scale):
pdf = pdfjs.getDocument(input).promise
n_pages = pdf.numPages
n_digits = len(str(n_pages))
sizes = []
for i in range(n_pages):
page = pdf.getPage(i+1)
viewport = page.getViewport({"scale": scale})
w, h = int(viewport.width), int(viewport.height)
sizes.append( (w, h) )
max_alloc = max(w*h for w, h in sizes) * 4
print(f"Shared memory size {max_alloc}")
memkey = "/pypdfjs_render_shm"
js_shm = libshm.create(max_alloc, "Buffer", memkey)
assert js_shm is not None, "Shared memory segment of this name already exists"
try:
py_shm_handle = posix_ipc.SharedMemory(memkey)
py_shm = mmap.mmap(py_shm_handle.fd, py_shm_handle.size)
for i in range(n_pages):
page = pdf.getPage(i+1)
viewport = page.getViewport({"scale": scale})
w, h = sizes[i]
canvas = libcanvas.createCanvas(w, h)
context = canvas.getContext("2d")
page.render({"canvasContext": context, "viewport": viewport}).promise
js_buffer = canvas.toBuffer("raw")
starttm = time.time()
js_buffer.copy(js_shm)
print(f"Data transfer took {time.time() - starttm}s")
pil_image = PIL.Image.frombuffer("RGBX", (w, h), py_shm, "raw", "BGRX", 0, 1)
py_shm.seek(0)
pil_image.save(outdir / f"out_{i:0{n_digits}d}.jpg")
finally:
libshm.destroy(memkey)
pdf.destroy() |
This patch would allow to transfer binary data (e.g. Buffer) directly from Js to Py, without JSON serialization, but using the existing stderr pipe mechanism (see #67).
Example on significance:
Rendering 10 PDF pages at ~200 DPI BGRX and transferring them via the existing
bytes(js_buffer.valueOf()["data"])
takes ~35s. The newjs_buffer.blobValueOf()
added here would reduce this to ~3s. (Interestingly enough, tempfiles are actually a bit faster thanblobValueOf()
.)That said, I'm rather uncertain about the implementation. Blobs can't be handled using a simple separator (such as
\n
) but need to be passed with explicit length, which breaksstderr_lines
.This patch changes to immediate parsing (rather than deferred on
readAll()
) so we can fetch the remainder after the first\n
to register a blob as a whole, instead of erroneously splitting into separate lines. — Is this OK? Any alternative ideas?readAll()
is now effectively a no-op, merely returning the prepared batch of items. I guess it should also be possible to feed the individual items into the queue directly?