Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF writing fails with joblib.Parallel using (default) Loky backend #181

Open
leoschwarz opened this issue Aug 12, 2024 · 6 comments
Open
Labels
bug Something isn't working

Comments

@leoschwarz
Copy link

leoschwarz commented Aug 12, 2024

What happened?

Dear developers,
I'm not sure if this is well known, but the following code

import altair as alt
import joblib
import pandas as pd
import os

os.environ["RUST_BACKTRACE"] = "full"


def write_chart(filename):
    df = pd.DataFrame({"x": [2, 3, 4], "y": [5, 5, 3]})
    chart = alt.Chart(df).mark_point().encode(x="x", y="y")
    chart.save(filename)


filenames = [f"chart{i}.pdf" for i in range(2)]
joblib.Parallel(n_jobs=2)(joblib.delayed(write_chart)(filename) for filename in filenames)

results in an error (full message below).

I'm reporting it here since I triggered with altair and would be nice to address with a fix or documentation, but maybe the issue originates in another project and it is beyond the scope of this issue tracker. If you think this would better fit into the loky or deno tracker I'm happy to move it there.

/Users/leo/code/msi/.venv/bin/python /Users/leo/code/msi/code/prototyping/altair_minimal_pdf.py 
thread '<unnamed>' panicked at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/deno_runtime-0.166.0/worker.rs:701:7:
Bootstrap exception: TypeError: testEnabled is not a function
    at init (ext:deno_node/internal/util/debuglog.ts:51:15)
    at debug (ext:deno_node/internal/util/debuglog.ts:54:5)
    at logger (ext:deno_node/internal/util/debuglog.ts:69:29)
    at readableAddChunk (ext:deno_node/_stream.mjs:2797:7)
    at Readable.push (ext:deno_node/_stream.mjs:2791:14)
    at initStdin (ext:deno_node/_process/streams.mjs:185:13)
    at Object.internals.__bootstrapNodeProcess (node:process:653:22)
    at initialize (ext:deno_node/02_init.js:34:15)
    at bootstrapMainRuntime (ext:runtime_main/js/99_main.js:873:7)
stack backtrace:
   0:        0x31089fa1c - _BrotliDecoderVersion
   1:        0x310098630 - _BrotliDecoderVersion
   2:        0x310875ef8 - _BrotliDecoderVersion
   3:        0x3108a20b8 - _BrotliDecoderVersion
   4:        0x3108a19d0 - _BrotliDecoderVersion
   5:        0x3108a15d8 - _BrotliDecoderVersion
   6:        0x3108a28e8 - _BrotliDecoderVersion
   7:        0x3108a23d0 - _BrotliDecoderVersion
   8:        0x3108a2338 - _BrotliDecoderVersion
   9:        0x3108a232c - _BrotliDecoderVersion
  10:        0x312068e9c - _wgpu_render_bundle_insert_debug_marker
  11:        0x310a71160 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  12:        0x310a6e184 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  13:        0x310a806c0 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  14:        0x3108a4e74 - _BrotliDecoderVersion
  15:        0x1937baf94 - __pthread_joiner_wake
thread '<unnamed>' panicked at /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/deno_runtime-0.166.0/worker.rs:701:7:
Bootstrap exception: TypeError: testEnabled is not a function
    at init (ext:deno_node/internal/util/debuglog.ts:51:15)
    at debug (ext:deno_node/internal/util/debuglog.ts:54:5)
    at logger (ext:deno_node/internal/util/debuglog.ts:69:29)
    at readableAddChunk (ext:deno_node/_stream.mjs:2797:7)
    at Readable.push (ext:deno_node/_stream.mjs:2791:14)
    at initStdin (ext:deno_node/_process/streams.mjs:185:13)
    at Object.internals.__bootstrapNodeProcess (node:process:653:22)
    at initialize (ext:deno_node/02_init.js:34:15)
    at bootstrapMainRuntime (ext:runtime_main/js/99_main.js:873:7)
stack backtrace:
   0:        0x14889fa1c - _BrotliDecoderVersion
   1:        0x148098630 - _BrotliDecoderVersion
   2:        0x148875ef8 - _BrotliDecoderVersion
   3:        0x1488a20b8 - _BrotliDecoderVersion
   4:        0x1488a19d0 - _BrotliDecoderVersion
   5:        0x1488a15d8 - _BrotliDecoderVersion
   6:        0x1488a28e8 - _BrotliDecoderVersion
   7:        0x1488a23d0 - _BrotliDecoderVersion
   8:        0x1488a2338 - _BrotliDecoderVersion
   9:        0x1488a232c - _BrotliDecoderVersion
  10:        0x14a068e9c - _wgpu_render_bundle_insert_debug_marker
  11:        0x148a71160 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  12:        0x148a6e184 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  13:        0x148a806c0 - _v8_inspector__V8InspectorClient__BASE__consoleAPIMessage
  14:        0x1488a4e74 - _BrotliDecoderVersion
  15:        0x1937baf94 - __pthread_joiner_wake
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
        ^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 598, in __call__
    return [func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 598, in <listcomp>
    return [func(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/code/prototyping/altair_minimal_pdf.py", line 12, in write_chart
    chart.save(filename)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/vegalite/v5/api.py", line 2086, in save
    save(**kwds)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/save.py", line 224, in save
    perform_save()
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/save.py", line 189, in perform_save
    mb_any = spec_to_mimebundle(
             ^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/mimebundle.py", line 134, in spec_to_mimebundle
    return _spec_to_mimebundle_with_engine(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/altair/utils/mimebundle.py", line 267, in _spec_to_mimebundle_with_engine
    pdf = vlc.vegalite_to_pdf(
          ^^^^^^^^^^^^^^^^^^^^
ValueError: Vega-Lite to PDF conversion failed:
Failed to retrieve conversion result: oneshot canceled
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/leo/code/msi/code/prototyping/altair_minimal_pdf.py", line 16, in <module>
    joblib.Parallel(n_jobs=2)(joblib.delayed(write_chart)(filename) for filename in filenames)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 2007, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 1650, in _get_outputs
    yield from self._retrieve()
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 1754, in _retrieve
    self._raise_error_fast()
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 1789, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 745, in get_result
    return self._return_or_raise()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leo/code/msi/.venv/lib/python3.11/site-packages/joblib/parallel.py", line 763, in _return_or_raise
    raise self._result
ValueError: Vega-Lite to PDF conversion failed:
Failed to retrieve conversion result: oneshot canceled

Process finished with exit code 1

What would you like to happen instead?

The loop should work without an error, which is the case if you set either of:

  • n_jobs=1
  • backend="multiprocessing" and add freeze_support()

Especially the latter is interesting and is what I am using as a workaround now.

Which version of Altair are you using?

5.4.0

@leoschwarz leoschwarz added the bug Something isn't working label Aug 12, 2024
@jonmmease jonmmease transferred this issue from vega/altair Aug 12, 2024
@jonmmease
Copy link
Collaborator

Hi @leoschwarz, thanks for the report. I've transferred this over to the vl-convert repo which implements the image export logic.

The vl-convert bundles the Deno JavaScript runtime which only supports running on a single thread, but my understanding is that the loky backend uses separate processes, so I'm not certain that's the issue. Do you have the same issue using the multiprocessing API?

@leoschwarz
Copy link
Author

Thank you for transferring the issue. With multiprocessing I cannot reproduce this issue, i.e. the following works:

from multiprocessing import freeze_support

import altair as alt
import joblib
import os
import pandas as pd

os.environ["RUST_BACKTRACE"] = "full"


def write_chart(filename):
    df = pd.DataFrame({"x": [2, 3, 4], "y": [5, 5, 3]})
    chart = alt.Chart(df).mark_point().encode(x="x", y="y")
    chart.save(filename)


if __name__ == "__main__":
    freeze_support()
    filenames = [f"chart{i}.pdf" for i in range(2)]
    joblib.Parallel(n_jobs=10, backend="multiprocessing")(
        joblib.delayed(write_chart)(filename) for filename in filenames)

Taken from the loky README:

"All processes are started using fork + exec on POSIX systems. This ensures safer interactions with third party libraries. On the contrary, multiprocessing.Pool uses fork without exec by default, causing third party runtimes to crash (e.g. OpenMP, macOS Accelerate...)."

So my understanding is they use different fork models, but in this case the default multiprocessing seems to work whereas loky does not. I'm not really an expert on the details of multiprocessing to understand how this relates to Deno's runtime.

@jonmmease
Copy link
Collaborator

Thanks for the investigation @leoschwarz. Documentation is probably the best first step, just to let people know that the multiprocessing backend works but loky backend does not.

One thing that we might be able to do is expose an alternative API that doesn't rely on a global instance of the Rust object that wraps Deno. We might be able to expose a VlConverter() class that wraps a dedicated instance of the Deno, that has methods for each of the global vl_convert.* functions. The hope would be that if you create and use this from within the forked process, then everything will work fine since the global Deno instance wouldn't be forked.

@leoschwarz
Copy link
Author

I'm not sure if that would fully resolve the problem yet, because my workflow is basically joblib distributing tasks which execute the plotting within a new subprocess each starting its own Python interpreter (largely to avoid this type of problem), so I suspect the problem is located in a native extension doing something unusual with memory somewhere. I'm looking into creating a better example for this.

@leoschwarz
Copy link
Author

So I've finally gotten around to trace this a bit further since it still is a problem for me.

import pandas as pd
import altair as alt
import sys
import os
import threading
import argparse

def write_chart(filename):
    df = pd.DataFrame({"x": [2, 3, 4], "y": [5, 5, 3]})
    chart = alt.Chart(df).mark_line().encode(x="x", y="y")
    chart.save(filename)


def run_with_thread():
    thread = threading.Thread(target=write_chart, args=("chart_thread.pdf",))
    thread.start()
    thread.join()

def run_with_fork():
    if os.fork() == 0:
        write_chart("chart_fork.pdf")
        sys.exit(0)
    os.wait()

parser = argparse.ArgumentParser()
parser.add_argument("variant", choices=["thread", "fork", "thread-fork", "fork-thread"])
args = parser.parse_args()

if args.variant == "thread":
    run_with_thread()
elif args.variant == "fork":
    run_with_fork()
elif args.variant == "thread-fork":
    run_with_thread()
    run_with_fork()
elif args.variant == "fork-thread":
    run_with_fork()
    run_with_thread()

Every variant except "thread-fork" works as expected.
The error is different this time, but I believe it's the same underlying issue that leads to it as in my experience with joblib Loky, as the latter uses forks too.

thread '<unnamed>' panicked at /Users/leo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/io/driver.rs:209:27:
failed to wake I/O driver: Os { code: 9, kind: Uncategorized, message: "Bad file descriptor" }

New Python versions also give us the following warning message:

DeprecationWarning: This process (pid=78279) is multi-threaded, use of fork() may lead to deadlocks in the child.

From what I understand, when a fork is created only the main thread will be present in the fork, so if in deno there is somehow some pointers to some existing threads, then this information will be lost in the forked process which leads to issues. I'm not sure if there is some easy way to avoid this type of problem...

My personal take away is to stop using software that calls os.fork in Python, as we will have proper threading soon. For this we can use subprocessing directly making sure to use the spawn instead of fork context (the default is different on Mac and Linux which can make this a bit more confusing if you are developing on both platforms)...


Initially I created this test case which someone might find handy later

import pytest
import joblib
import pandas as pd
import altair as alt
import sys
import time
from multiprocessing import freeze_support
import os


def write_chart(filename):
    df = pd.DataFrame({"x": [2, 3, 4], "y": [5, 5, 3]})
    chart = alt.Chart(df).mark_point().encode(x="x", y="y")
    chart.save(filename)


@pytest.mark.parametrize("backend", ["multiprocessing", "threading", "loky", None])
@pytest.mark.parametrize("ext", ["png", "pdf"])
def test_joblib(tmpdir, backend, ext):
    filenames = [str(tmpdir/f"chart{i}.{ext}") for i in range(2)]
    if backend:
        freeze_support()
        joblib.Parallel(n_jobs=2, backend=backend)(joblib.delayed(write_chart)(filename) for filename in filenames)
    else:
        for filename in filenames:
            write_chart(filename)

The good news is that it works with joblib's other backends.

@jonmmease
Copy link
Collaborator

Thanks for the detailed writeup. If you have a chance, it would be great to add a short summary to the Limitations section of the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants