Allow to transfer binary data without serialization (CC #67) #103

mara004 · 2023-10-15T19:52:31Z

This patch would allow to transfer binary data (e.g. Buffer) directly from Js to Py, without JSON serialization, but using the existing stderr pipe mechanism (see #67).

Example on significance:
Rendering 10 PDF pages at ~200 DPI BGRX and transferring them via the existing bytes(js_buffer.valueOf()["data"]) takes ~35s. The new js_buffer.blobValueOf() added here would reduce this to ~3s. (Interestingly enough, tempfiles are actually a bit faster than blobValueOf().)

That said, I'm rather uncertain about the implementation. Blobs can't be handled using a simple separator (such as \n) but need to be passed with explicit length, which breaks stderr_lines.

This patch changes to immediate parsing (rather than deferred on readAll()) so we can fetch the remainder after the first \n to register a blob as a whole, instead of erroneously splitting into separate lines. — Is this OK? Any alternative ideas?

readAll() is now effectively a no-op, merely returning the prepared batch of items. I guess it should also be possible to feed the individual items into the queue directly?

src/javascript/js/bridge.js

extremeheat · 2023-10-16T19:40:11Z

Thanks for working on this. I agree sending binary data/buffer data is going to be very slow at present. The data has to be encoded as UTF8, JSON serialized, UFT8 decoded, JSON parsed, then converted into the desired data type. This is especially bad as JSON.stringify on a Buffer will write all the bytes in a comma delimited ASCII array.

Ideally, length-prefixed binary data would be directly written over the stdio pipes without those transformations, however that's a bit challenging as we use Python's .readline() stream reading API, that uses newlines to delimit different data packets (and the binary data can itself contain newlines or null terms). This will require some thought on the best way to optimize the communication, maybe even refactor, so I'll have to think about it some more.

One easy way I can think to improve the performance, at least slightly, via writing serializing less data is to send Base64-encoded binary Buffer data instead of JSON. Would be great to have a script to benchmark the existing implementation vs base64.

mara004 · 2023-10-17T14:14:32Z

Thanks for the response!

Ideally, length-prefixed binary data would be directly written over the stdio pipes without those transformations, however that's a bit challenging as we use Python's .readline() stream reading API, that uses newlines to delimit different data packets (and the binary data can itself contain newlines or null terms). This will require some thought on the best way to optimize the communication, maybe even refactor, so I'll have to think about it some more.

Yes, I've been wondering about that, too. The proposed patch basically handles this by immediately parsing the server's responses and giving the parser access to the stream to fetch possible data after the first \n if a length-annotated blob request is incoming (see code passage). This seemed like a quick, minimally invasive fix.
However, I agree it may be a good idea to eventually revisit the overall IPC concept in the light of blobs.

One easy way I can think to improve the performance, at least slightly, via writing serializing less data is to send Base64-encoded binary Buffer data instead of JSON.

Yes, I expect base64 would be better than a utf-8 encoded, comma-delimited json array, but it'd still imply an encode/decode layer and some size increase. This patch should result in more performance gain for blob transfer than just base64.

extremeheat · 2023-10-24T08:00:31Z

Sounds good, can you fix the conflict by merging master?

Adding a test to https://github.com/extremeheat/JSPyBridge/blob/master/test/javascript/test_general.py to make sure this works would be great, then we can merge this PR.

Similar functionality would probably make sense for Python -> JS, but that can be explored later.

mara004 · 2023-10-24T12:13:29Z

Yes, I plan to do the rebase/test/docs (and maybe also add the pdfjs example?), but it might take some time until I can resume working on this.

You're right, something similar would be needed for Py to JS, but I haven't looked at that side of the communication yet, since it was not my immediate use case. But IIRC pdfjs also supports opening PDFs from a byte string, so we might get to that as well sooner or later.

mara004 · 2023-11-02T15:01:33Z

@extremeheat I believe this may be ready for review/merge.

Could you please check the Jupyter notebook code passage is still working after the change? I'm not familiar with this tool.

It sometimes works, sometimes not. See below for examples: Working: https://cinelerra-gg.org/download/CinelerraGG_Manual.pdf https://pikepdf.readthedocs.io/_/downloads/en/latest/pdf/ Failing: https://www.gemeinde-grindelwald.ch/wp-content/uploads/2020/06/Broschuere_Landwirtschaft_und_Tourismus_Franzoesisch.pdf -> getHexString reports invalid characters. File renders when downloaded, though. Further, the document below produces an out of memory error, but also when rendering locally https://bro.jrtag.ch/Gri/so_guide/de_en/files/downloads/Infoguide_Sommer_23-WEB.pdf

mara004 · 2023-11-05T14:48:47Z

Now a macOS test is failing, but I don't understand the error. Seems to be something with a mismatched eval_js() expectation. Is it really caused by this change? Could you run the test workflow on master to see if it also fails?

extremeheat · 2023-11-18T08:03:18Z

Sorry for the delay here, I'll be testing this locally as soon as possible and merge if no problems arise. Thanks for the work on this :)

mara004 · 2023-11-18T13:11:22Z

Thanks, I yet pushed some final style touches.

Hope the Readme changes are OK. I tried to rephrase some things intending to make them easier to understand, but am somewhat afraid I might have got something wrong due to lack of knowledge of the JS/IPC subject.
There are still some phrases in that section I don't entirely understand...

mara004 · 2023-11-18T15:45:11Z

When merging, would you please also squash the commits? Thanks!

- match test name letter case - add print and 2 more test values - use list comp

extremeheat · 2023-11-27T06:59:11Z

Thanks for the work on this!

mara004 · 2023-11-27T14:43:41Z

Thanks for merging!

mara004 · 2023-11-30T01:28:08Z

Would you mind triggering a new release for this?

extremeheat · 2023-11-30T08:54:09Z

/makerelease

mara004 · 2023-12-09T19:36:25Z

Hmm, while this is certainly faster than before, it's still in the nature of data piping to be rather slow.

For a sequence of static or average size discardable data, I imagine shared memory would be a lot better.
e.g. think of use cases like streaming a video from JS to Python. Also with PDFs: most have sort of homogeneous page size, so we could just pre-allocate a shared buffer the size of the largest page.

I don't know how expensive it is to allocate shared memory, or if there are specific limits.
I'd be curious how a procedure of allocating new shared memory, copying in the data, and taking over the segment with python would compare in performance to the current blobValueOf().

Is anyone aware of a cross-platform, maintained shared memory library in JS that could be connected with Python's multiprocessing.shared_memory module?

Reverts change #103 as sys.stdout.fileno doesn't exist on Colab notebooks; also fix color outputs in notebooks

mara004 · 2024-09-09T17:21:47Z

I think I finally found a way to use shared memory via shm-typed-array (js) and posix_ipc (py).
Downside is, shm-typed-array doesn't support Windows.

multiprocessing.shared_memory didn't work because it's not specified how it is implemented. Seems like it's meant only for communication between Python processes, not generic other processes. posix_ipc solves this problem.

A missing piece is, I don't know how to create a canvas backed by an external buffer, so we currently have to copy into the shared memory segment. Seems like the canvas library might not provide an API to do this.
Yet, it is already faster than .blobValueOf().

There are also some cleanup issues to be figured out (apparently, the shared memory is not destroyed on keyboard interrupt, despite try/finally wrapper).

import time
import mmap
import argparse
from pathlib import Path

# third-party
import PIL.Image
import javascript
import posix_ipc

# NOTE canvas must be the build pdfjs is linked against, otherwise it'll fail with type error
pdfjs = javascript.require("pdfjs-dist")
libcanvas = javascript.require("canvas")
libshm = javascript.require("shm-typed-array")


def render_pdf(input, outdir, scale):
    
    pdf = pdfjs.getDocument(input).promise
    n_pages = pdf.numPages
    n_digits = len(str(n_pages))
    
    sizes = []
    for i in range(n_pages):
        page = pdf.getPage(i+1)
        viewport = page.getViewport({"scale": scale})
        w, h = int(viewport.width), int(viewport.height)
        sizes.append( (w, h) )
    
    max_alloc = max(w*h for w, h in sizes) * 4
    print(f"Shared memory size {max_alloc}")
    
    memkey = "/pypdfjs_render_shm"
    js_shm = libshm.create(max_alloc, "Buffer", memkey)
    assert js_shm is not None, "Shared memory segment of this name already exists"
    
    try:
        py_shm_handle = posix_ipc.SharedMemory(memkey)
        py_shm = mmap.mmap(py_shm_handle.fd, py_shm_handle.size)
        
        for i in range(n_pages):
            
            page = pdf.getPage(i+1)
            viewport = page.getViewport({"scale": scale})
            w, h = sizes[i]
            
            canvas = libcanvas.createCanvas(w, h)
            context = canvas.getContext("2d")
            page.render({"canvasContext": context, "viewport": viewport}).promise
            
            js_buffer = canvas.toBuffer("raw")
            starttm = time.time()
            js_buffer.copy(js_shm)
            print(f"Data transfer took {time.time() - starttm}s")
            
            pil_image = PIL.Image.frombuffer("RGBX", (w, h), py_shm, "raw", "BGRX", 0, 1)
            py_shm.seek(0)
            pil_image.save(outdir / f"out_{i:0{n_digits}d}.jpg")
    finally:
        libshm.destroy(memkey)
        pdf.destroy()

mara004 changed the title ~~[exp] allow to transfer blobs without serialization (CC #67)~~ [Experimental] Allow to transfer blobs without serialization (CC #67) Oct 15, 2023

mara004 changed the title ~~[Experimental] Allow to transfer blobs without serialization (CC #67)~~ Allow to transfer blobs without serialization (CC #67) Oct 15, 2023

mara004 mentioned this pull request Oct 15, 2023

When, why and how to use valueOf()? #67

Open

mara004 commented Oct 15, 2023

View reviewed changes

src/javascript/js/bridge.js Show resolved Hide resolved

mara004 changed the title ~~Allow to transfer blobs without serialization (CC #67)~~ Allow to transfer binary data without serialization (CC #67) Oct 15, 2023

mara004 force-pushed the sendblob branch 10 times, most recently from 10ceb3f to e98823d Compare November 2, 2023 15:00

mara004 marked this pull request as ready for review November 2, 2023 15:00

Transfer blobs without serialization (CC extremeheat#67)

352f77b

mara004 force-pushed the sendblob branch from c062a16 to 352f77b Compare November 2, 2023 17:06

mara004 added 2 commits November 2, 2023 18:37

Merge remote-tracking branch 'fork/master' into sendblob

c7bb8f8

nits

7634cd9

mara004 force-pushed the sendblob branch from 4d9fad6 to 7634cd9 Compare November 2, 2023 17:41

This comment was marked as outdated.

Sign in to view

rework print

fbc777e

mara004 force-pushed the sendblob branch 2 times, most recently from 753ef4a to f753c9b Compare November 4, 2023 13:33

mara004 force-pushed the sendblob branch 2 times, most recently from b06913e to caef781 Compare November 4, 2023 14:02

style improvements, add debug() calls

2fd20f1

mara004 force-pushed the sendblob branch from caef781 to 2fd20f1 Compare November 4, 2023 14:05

mara004 force-pushed the sendblob branch from 394929a to af53b18 Compare November 4, 2023 15:49

mara004 added 3 commits November 8, 2023 22:25

Add test case for surrounding newlines specifically

c20d623

further polish test cases

1bb9327

make blob debug output less confusing

cbee798

mara004 added 2 commits November 18, 2023 13:52

test: fix wrong timing endpoint

5769130

prefer consistent spelling (com vs comm)

04db692

nit: use consistent test message

29158bd

test nits

5cdf09e

- match test name letter case - add print and 2 more test values - use list comp

mara004 force-pushed the sendblob branch from ce14c65 to 5cdf09e Compare November 19, 2023 13:34

mara004 mentioned this pull request Nov 19, 2023

Calling function with large string parameter blocks forever #70

Open

extremeheat merged commit 36a0c74 into extremeheat:master Nov 27, 2023
6 checks passed

import-bot mentioned this pull request Nov 30, 2023

Release 1.1.0 #110

Merged

extremeheat added a commit that referenced this pull request Dec 13, 2023

Fix javascript stdout not appearing in Python on notebooks

3b0d69c

Reverts change #103 as sys.stdout.fileno doesn't exist on Colab notebooks; also fix color outputs in notebooks

extremeheat mentioned this pull request Dec 13, 2023

Fix javascript stdout not appearing in Python on notebooks #112

Merged

extremeheat added a commit that referenced this pull request Dec 13, 2023

Fix javascript stdout not appearing in Python on notebooks (#112)

d2ec236

Reverts change #103 as sys.stdout.fileno doesn't exist on Colab notebooks; also fix color outputs in notebooks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to transfer binary data without serialization (CC #67) #103

Allow to transfer binary data without serialization (CC #67) #103

mara004 commented Oct 15, 2023 •

edited

Loading

extremeheat commented Oct 16, 2023

mara004 commented Oct 17, 2023 •

edited

Loading

extremeheat commented Oct 24, 2023

mara004 commented Oct 24, 2023 •

edited

Loading

mara004 commented Nov 2, 2023 •

edited

Loading

This comment was marked as outdated.

mara004 commented Nov 5, 2023 •

edited

Loading

extremeheat commented Nov 18, 2023

mara004 commented Nov 18, 2023 •

edited

Loading

mara004 commented Nov 18, 2023

extremeheat commented Nov 27, 2023

mara004 commented Nov 27, 2023

mara004 commented Nov 30, 2023

extremeheat commented Nov 30, 2023

mara004 commented Dec 9, 2023 •

edited

Loading

mara004 commented Sep 9, 2024 •

edited

Loading

Allow to transfer binary data without serialization (CC #67) #103

Allow to transfer binary data without serialization (CC #67) #103

Conversation

mara004 commented Oct 15, 2023 • edited Loading

extremeheat commented Oct 16, 2023

mara004 commented Oct 17, 2023 • edited Loading

extremeheat commented Oct 24, 2023

mara004 commented Oct 24, 2023 • edited Loading

mara004 commented Nov 2, 2023 • edited Loading

This comment was marked as outdated.

mara004 commented Nov 5, 2023 • edited Loading

extremeheat commented Nov 18, 2023

mara004 commented Nov 18, 2023 • edited Loading

mara004 commented Nov 18, 2023

extremeheat commented Nov 27, 2023

mara004 commented Nov 27, 2023

mara004 commented Nov 30, 2023

extremeheat commented Nov 30, 2023

mara004 commented Dec 9, 2023 • edited Loading

mara004 commented Sep 9, 2024 • edited Loading

mara004 commented Oct 15, 2023 •

edited

Loading

mara004 commented Oct 17, 2023 •

edited

Loading

mara004 commented Oct 24, 2023 •

edited

Loading

mara004 commented Nov 2, 2023 •

edited

Loading

mara004 commented Nov 5, 2023 •

edited

Loading

mara004 commented Nov 18, 2023 •

edited

Loading

mara004 commented Dec 9, 2023 •

edited

Loading

mara004 commented Sep 9, 2024 •

edited

Loading