Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to transfer binary data without serialization (CC #67) #103

Merged
merged 13 commits into from
Nov 27, 2023

Conversation

mara004
Copy link
Contributor

@mara004 mara004 commented Oct 15, 2023

This patch would allow to transfer binary data (e.g. Buffer) directly from Js to Py, without JSON serialization, but using the existing stderr pipe mechanism (see #67).

Example on significance:
Rendering 10 PDF pages at ~200 DPI BGRX and transferring them via the existing bytes(js_buffer.valueOf()["data"]) takes ~35s. The new js_buffer.blobValueOf() added here would reduce this to ~3s. (Interestingly enough, tempfiles are actually a bit faster than blobValueOf().)


That said, I'm rather uncertain about the implementation. Blobs can't be handled using a simple separator (such as \n) but need to be passed with explicit length, which breaks stderr_lines.

This patch changes to immediate parsing (rather than deferred on readAll()) so we can fetch the remainder after the first \n to register a blob as a whole, instead of erroneously splitting into separate lines. — Is this OK? Any alternative ideas?

readAll() is now effectively a no-op, merely returning the prepared batch of items. I guess it should also be possible to feed the individual items into the queue directly?

@mara004 mara004 changed the title [exp] allow to transfer blobs without serialization (CC #67) [Experimental] Allow to transfer blobs without serialization (CC #67) Oct 15, 2023
@mara004 mara004 changed the title [Experimental] Allow to transfer blobs without serialization (CC #67) Allow to transfer blobs without serialization (CC #67) Oct 15, 2023
@mara004 mara004 changed the title Allow to transfer blobs without serialization (CC #67) Allow to transfer binary data without serialization (CC #67) Oct 15, 2023
@extremeheat
Copy link
Owner

Thanks for working on this. I agree sending binary data/buffer data is going to be very slow at present. The data has to be encoded as UTF8, JSON serialized, UFT8 decoded, JSON parsed, then converted into the desired data type. This is especially bad as JSON.stringify on a Buffer will write all the bytes in a comma delimited ASCII array.

Ideally, length-prefixed binary data would be directly written over the stdio pipes without those transformations, however that's a bit challenging as we use Python's .readline() stream reading API, that uses newlines to delimit different data packets (and the binary data can itself contain newlines or null terms). This will require some thought on the best way to optimize the communication, maybe even refactor, so I'll have to think about it some more.

One easy way I can think to improve the performance, at least slightly, via writing serializing less data is to send Base64-encoded binary Buffer data instead of JSON. Would be great to have a script to benchmark the existing implementation vs base64.

@mara004
Copy link
Contributor Author

mara004 commented Oct 17, 2023

Thanks for the response!

Ideally, length-prefixed binary data would be directly written over the stdio pipes without those transformations, however that's a bit challenging as we use Python's .readline() stream reading API, that uses newlines to delimit different data packets (and the binary data can itself contain newlines or null terms). This will require some thought on the best way to optimize the communication, maybe even refactor, so I'll have to think about it some more.

Yes, I've been wondering about that, too. The proposed patch basically handles this by immediately parsing the server's responses and giving the parser access to the stream to fetch possible data after the first \n if a length-annotated blob request is incoming (see code passage). This seemed like a quick, minimally invasive fix.
However, I agree it may be a good idea to eventually revisit the overall IPC concept in the light of blobs.

One easy way I can think to improve the performance, at least slightly, via writing serializing less data is to send Base64-encoded binary Buffer data instead of JSON.

Yes, I expect base64 would be better than a utf-8 encoded, comma-delimited json array, but it'd still imply an encode/decode layer and some size increase. This patch should result in more performance gain for blob transfer than just base64.

@extremeheat
Copy link
Owner

Sounds good, can you fix the conflict by merging master?

Adding a test to https://github.com/extremeheat/JSPyBridge/blob/master/test/javascript/test_general.py to make sure this works would be great, then we can merge this PR.

Similar functionality would probably make sense for Python -> JS, but that can be explored later.

@mara004
Copy link
Contributor Author

mara004 commented Oct 24, 2023

Yes, I plan to do the rebase/test/docs (and maybe also add the pdfjs example?), but it might take some time until I can resume working on this.

You're right, something similar would be needed for Py to JS, but I haven't looked at that side of the communication yet, since it was not my immediate use case. But IIRC pdfjs also supports opening PDFs from a byte string, so we might get to that as well sooner or later.

@mara004 mara004 force-pushed the sendblob branch 10 times, most recently from 10ceb3f to e98823d Compare November 2, 2023 15:00
@mara004 mara004 marked this pull request as ready for review November 2, 2023 15:00
@mara004
Copy link
Contributor Author

mara004 commented Nov 2, 2023

@extremeheat I believe this may be ready for review/merge.

Could you please check the Jupyter notebook code passage is still working after the change? I'm not familiar with this tool.

@mara004

This comment was marked as outdated.

@mara004 mara004 force-pushed the sendblob branch 2 times, most recently from 753ef4a to f753c9b Compare November 4, 2023 13:33
@mara004 mara004 force-pushed the sendblob branch 2 times, most recently from b06913e to caef781 Compare November 4, 2023 14:02
It sometimes works, sometimes not. See below for examples:

Working:
https://cinelerra-gg.org/download/CinelerraGG_Manual.pdf
https://pikepdf.readthedocs.io/_/downloads/en/latest/pdf/

Failing:
https://www.gemeinde-grindelwald.ch/wp-content/uploads/2020/06/Broschuere_Landwirtschaft_und_Tourismus_Franzoesisch.pdf
-> getHexString reports invalid characters. File renders when
downloaded, though.

Further, the document below produces an out of memory error, but also
when rendering locally
https://bro.jrtag.ch/Gri/so_guide/de_en/files/downloads/Infoguide_Sommer_23-WEB.pdf
@mara004
Copy link
Contributor Author

mara004 commented Nov 5, 2023

Now a macOS test is failing, but I don't understand the error. Seems to be something with a mismatched eval_js() expectation. Is it really caused by this change? Could you run the test workflow on master to see if it also fails?

@extremeheat
Copy link
Owner

Sorry for the delay here, I'll be testing this locally as soon as possible and merge if no problems arise. Thanks for the work on this :)

@mara004
Copy link
Contributor Author

mara004 commented Nov 18, 2023

Thanks, I yet pushed some final style touches.

Hope the Readme changes are OK. I tried to rephrase some things intending to make them easier to understand, but am somewhat afraid I might have got something wrong due to lack of knowledge of the JS/IPC subject.
There are still some phrases in that section I don't entirely understand...

@mara004
Copy link
Contributor Author

mara004 commented Nov 18, 2023

When merging, would you please also squash the commits? Thanks!

- match test name letter case
- add print and 2 more test values
- use list comp
@extremeheat
Copy link
Owner

Thanks for the work on this!

@extremeheat extremeheat merged commit 36a0c74 into extremeheat:master Nov 27, 2023
6 checks passed
@mara004
Copy link
Contributor Author

mara004 commented Nov 27, 2023

Thanks for merging!

@mara004
Copy link
Contributor Author

mara004 commented Nov 30, 2023

Would you mind triggering a new release for this?

@extremeheat
Copy link
Owner

/makerelease

@import-bot import-bot mentioned this pull request Nov 30, 2023
@mara004
Copy link
Contributor Author

mara004 commented Dec 9, 2023

Hmm, while this is certainly faster than before, it's still in the nature of data piping to be rather slow.

For a sequence of static or average size discardable data, I imagine shared memory would be a lot better.
e.g. think of use cases like streaming a video from JS to Python. Also with PDFs: most have sort of homogeneous page size, so we could just pre-allocate a shared buffer the size of the largest page.

I don't know how expensive it is to allocate shared memory, or if there are specific limits.
I'd be curious how a procedure of allocating new shared memory, copying in the data, and taking over the segment with python would compare in performance to the current blobValueOf().

Is anyone aware of a cross-platform, maintained shared memory library in JS that could be connected with Python's multiprocessing.shared_memory module?

extremeheat added a commit that referenced this pull request Dec 13, 2023
Reverts change #103 as sys.stdout.fileno doesn't exist on Colab notebooks; also fix color outputs in notebooks
extremeheat added a commit that referenced this pull request Dec 13, 2023
Reverts change #103 as sys.stdout.fileno doesn't exist on Colab notebooks; also fix color outputs in notebooks
@mara004
Copy link
Contributor Author

mara004 commented Sep 9, 2024

I think I finally found a way to use shared memory via shm-typed-array (js) and posix_ipc (py).
Downside is, shm-typed-array doesn't support Windows.

multiprocessing.shared_memory didn't work because it's not specified how it is implemented. Seems like it's meant only for communication between Python processes, not generic other processes. posix_ipc solves this problem.

A missing piece is, I don't know how to create a canvas backed by an external buffer, so we currently have to copy into the shared memory segment. Seems like the canvas library might not provide an API to do this.
Yet, it is already faster than .blobValueOf().

There are also some cleanup issues to be figured out (apparently, the shared memory is not destroyed on keyboard interrupt, despite try/finally wrapper).

import time
import mmap
import argparse
from pathlib import Path

# third-party
import PIL.Image
import javascript
import posix_ipc

# NOTE canvas must be the build pdfjs is linked against, otherwise it'll fail with type error
pdfjs = javascript.require("pdfjs-dist")
libcanvas = javascript.require("canvas")
libshm = javascript.require("shm-typed-array")


def render_pdf(input, outdir, scale):
    
    pdf = pdfjs.getDocument(input).promise
    n_pages = pdf.numPages
    n_digits = len(str(n_pages))
    
    sizes = []
    for i in range(n_pages):
        page = pdf.getPage(i+1)
        viewport = page.getViewport({"scale": scale})
        w, h = int(viewport.width), int(viewport.height)
        sizes.append( (w, h) )
    
    max_alloc = max(w*h for w, h in sizes) * 4
    print(f"Shared memory size {max_alloc}")
    
    memkey = "/pypdfjs_render_shm"
    js_shm = libshm.create(max_alloc, "Buffer", memkey)
    assert js_shm is not None, "Shared memory segment of this name already exists"
    
    try:
        py_shm_handle = posix_ipc.SharedMemory(memkey)
        py_shm = mmap.mmap(py_shm_handle.fd, py_shm_handle.size)
        
        for i in range(n_pages):
            
            page = pdf.getPage(i+1)
            viewport = page.getViewport({"scale": scale})
            w, h = sizes[i]
            
            canvas = libcanvas.createCanvas(w, h)
            context = canvas.getContext("2d")
            page.render({"canvasContext": context, "viewport": viewport}).promise
            
            js_buffer = canvas.toBuffer("raw")
            starttm = time.time()
            js_buffer.copy(js_shm)
            print(f"Data transfer took {time.time() - starttm}s")
            
            pil_image = PIL.Image.frombuffer("RGBX", (w, h), py_shm, "raw", "BGRX", 0, 1)
            py_shm.seek(0)
            pil_image.save(outdir / f"out_{i:0{n_digits}d}.jpg")
    finally:
        libshm.destroy(memkey)
        pdf.destroy()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants