Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Python GIL lock handling (Fixes #6524, Fixes #5631) #6525

Merged
merged 4 commits into from
Jan 4, 2022

Conversation

LebedevRI
Copy link
Contributor

As disscussed in #6523 (comment)
and later in #6524,
pybind11 v2.8.1 added some defensive checks that fail for halide,
namely in python_tutorial_lesson_04_debugging_2
and python_tutorial_lesson_05_scheduling_1.

#6524 (comment) notes:

  • Python calls a Halide-JIT-generated function , which runs with the GIL held.
  • Halide runtime spawns worker threads.
  • The worker threads try to call pybind11's py::print function to emit traces.
  • Pybind11 complains, correctly, that the worker thread doesn't hold the GIL.

Trying to acquire the GIL hangs, because the main thread is still holding it. I tried teaching the main thread to release the GIL (as suggested in #5631), but I still saw hangs when I tried this.

I have tried, and just dropping the lock before calling into halide,
or just acquiring it in halide_python_print doesn't work,
we need to do both.

I have verified that the two tests fail without this fix,
and pass with it.

As disscussed in halide#6523 (comment)
and later in halide#6524,
pybind11 v2.8.1 added some defensive checks that fail for halide,
namely in `python_tutorial_lesson_04_debugging_2`
and `python_tutorial_lesson_05_scheduling_1`.

halide#6524 (comment) notes:
> * Python calls a Halide-JIT-generated function , which runs with the GIL held.
> * Halide runtime spawns worker threads.
> * The worker threads try to call pybind11's py::print function to emit traces.
> * Pybind11 complains, correctly, that the worker thread doesn't hold the GIL.
>
> Trying to acquire the GIL hangs, because the main thread is still holding it. I tried teaching the main thread to release the GIL (as suggested in halide#5631), but I still saw hangs when I tried this.

I have tried, and just dropping the lock before calling into halide,
or just acquiring it in `halide_python_print` doesn't work,
we need to do both.

I have verified that the two tests fail without this fix,
and pass with it.
@Infinoid
Copy link
Contributor

Infinoid commented Jan 2, 2022

I verified this patch fixes the failures I was seeing, and doesn't add any new ones.

100% tests passed, 0 tests failed out of 614

Thanks!

@LebedevRI
Copy link
Contributor Author

LebedevRI commented Jan 2, 2022

I verified this patch fixes the failures I was seeing, and doesn't add any new ones.

100% tests passed, 0 tests failed out of 614

Thanks!

Theoretically, we need to drop the lock every time we call halide API from python,
and take the lock every time we call python api from halide.

I only handled realize since that is the entry that caused the given crashes,
and only dealt with halide_python_print, since again that is where the reproduced problem was,
while e.g. HalidePythonCompileTimeErrorReporter is still "wrong".

So i very much expect that this adds more brokenness.

@Infinoid
Copy link
Contributor

Infinoid commented Jan 2, 2022

Theoretically, we need to drop the lock every time we call halide API from python,
and take the lock every time we call python api from halide.

For calls from Python into Halide that don't spawn threads, the easiest thing to do is leave the lock alone. Then it's fine if Halide calls back into Python, because it still holds the lock.

Other cases can be reviewed on a case-by-case basis.

@LebedevRI
Copy link
Contributor Author

Other cases can be reviewed on a case-by-case basis.

That's why i posted this yes, clearly if no one cared to fix this problem,
they won't mind having more potential problems.

@Infinoid
Copy link
Contributor

Infinoid commented Jan 2, 2022

Well, my point was, no threads == no problem. I don't think that pipeline definitions, scheduling, AOT compilation, target definitions or buffer accesses spawn threads.

The only entry points I can think of that do spawn threads are Func.realize(), Pipeline.realize(), and calling an AOT-compiled function, and I think this patch covers those cases. But I'm not an expert.

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about the GIL to be the ideal reviewer here, but I do think that we should update python_bindings/CMakeLists.txt to specify (at least) v2.8.1 for the version we pull for building. (Maybe 2.9, since that is apparently the most recent release?). If you update this PR the buildbots should test using the version specified.

@LebedevRI
Copy link
Contributor Author

I don't know enough about the GIL to be the ideal reviewer here, but I do think that we should update python_bindings/CMakeLists.txt to specify (at least) v2.8.1 for the version we pull for building. (Maybe 2.9, since that is apparently the most recent release?). If you update this PR the buildbots should test using the version specified.

Err, i do not follow. Why do you want to bump the required pybind version?
Just because version A is required, doesn't mean that buildbots can't test version A+1.

@steven-johnson
Copy link
Contributor

Err, i do not follow. Why do you want to bump the required pybind version?

Because our version is out of date and we like to keep close to the latest release unless theres a reason not t.

Just because version A is required, doesn't mean that buildbots can't test version A+1.

True, but there isn't any mechanism for doing so right now -- we'd have to add one.

@LebedevRI
Copy link
Contributor Author

LebedevRI commented Jan 3, 2022

Err, i do not follow. Why do you want to bump the required pybind version?

Because our version is out of date and we like to keep close to the latest release unless theres a reason not t.

I see, but then this is a separate concern from what we have at hand.

Roughly, with caveats, depending on the most newest version of everything makes it hard while not impossible
to use ${PROJ} on anything but the bleeding edge distributions.

Case in point, if you bump it to v2.9, then i'm simply going to abandon this patch,
because i will no longer be able to build it in debian unstable,
and if you push just to v2.8, then it will become unbuildable in debian stable,
which will create additional problems for the original reporter, @Infinoid.

Just because version A is required, doesn't mean that buildbots can't test version A+1.

True, but there isn't any mechanism for doing so right now -- we'd have to add one.

@steven-johnson
Copy link
Contributor

Case in point, if you bump it to v2.9, then i'm simply going to abandon this patch, because i will no longer be able to build it in debian unstable

Not true for the CMake build, which always pulls and builds a specific, captive version, ignoring what is present on the system. (If this isn't what's happening, it's a bug in our CMake rules.) Are you using Make?

@LebedevRI
Copy link
Contributor Author

Case in point, if you bump it to v2.9, then i'm simply going to abandon this patch, because i will no longer be able to build it in debian unstable

Not true for the CMake build, which always pulls and builds a specific, captive version, ignoring what is present on the system. (If this isn't what's happening, it's a bug in our CMake rules.) Are you using Make?

Hang on, we are talking about different things here, aren't we :)
I'm talking about the find_package() stuff, NOT FetchContent.
I don't have an opinion on what version of pybind you should bundle,
and i can bump that to v2.9 if you want,
but i do have an opinion on the required pybind version,
which i don't want to bump.

@steven-johnson
Copy link
Contributor

ah, ok, in that case I defer entirely too @alexreinking :-)

@LebedevRI
Copy link
Contributor Author

ah, ok, in that case I defer entirely too @alexreinking :-)

Bumped the bundled version, CI seems to be happy with it :)

@@ -4,13 +4,14 @@

find_package(Python3 REQUIRED COMPONENTS Interpreter Development)

set(PYBIND11_VER 2.6.2)
find_package(pybind11 ${PYBIND11_VER} QUIET)
set(PYBIND11_MIN_SUPPORTED_VER 2.6.2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I wonder why we don't always FetchContent for this, so we always get a known version; IIUC, PyBind11 is 100% a header-only library, with zero runtime dependencies, and it isn't that large.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading stuff during cmake/make time is a deal-breaker for packaging, for the record.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -4,13 +4,14 @@

find_package(Python3 REQUIRED COMPONENTS Interpreter Development)

set(PYBIND11_VER 2.6.2)
find_package(pybind11 ${PYBIND11_VER} QUIET)
set(PYBIND11_MIN_SUPPORTED_VER 2.6.2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python_bindings/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Member

@alexreinking alexreinking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #6531 to record that we need to test newer pybind11, too. With that backed out, this LGTM.

@alexreinking alexreinking added the backport me This change should be backported to release versions label Jan 4, 2022
@alexreinking
Copy link
Member

Failure is a performance test on llvm14 that does not use Python... cannot possibly be related.

@alexreinking alexreinking merged commit b8eb22d into halide:master Jan 4, 2022
@LebedevRI LebedevRI deleted the python-gil-lock branch January 4, 2022 21:56
@LebedevRI
Copy link
Contributor Author

@alexreinking thank you!

alexreinking pushed a commit that referenced this pull request Jan 5, 2022
* Fix Python GIL lock handling (Fixes #6524, Fixes #5631)

As disscussed in #6523 (comment)
and later in #6524,
pybind11 v2.8.1 added some defensive checks that fail for halide,
namely in `python_tutorial_lesson_04_debugging_2`
and `python_tutorial_lesson_05_scheduling_1`.

#6524 (comment) notes:
> * Python calls a Halide-JIT-generated function , which runs with the GIL held.
> * Halide runtime spawns worker threads.
> * The worker threads try to call pybind11's py::print function to emit traces.
> * Pybind11 complains, correctly, that the worker thread doesn't hold the GIL.
>
> Trying to acquire the GIL hangs, because the main thread is still holding it. I tried teaching the main thread to release the GIL (as suggested in #5631), but I still saw hangs when I tried this.

I have tried, and just dropping the lock before calling into halide,
or just acquiring it in `halide_python_print` doesn't work,
we need to do both.

I have verified that the two tests fail without this fix,
and pass with it.

(cherry picked from commit b8eb22d)
@alexreinking alexreinking removed the backport me This change should be backported to release versions label Jan 5, 2022
alexreinking pushed a commit that referenced this pull request Jan 6, 2022
* Fix Python GIL lock handling (Fixes #6524, Fixes #5631)

As disscussed in #6523 (comment)
and later in #6524,
pybind11 v2.8.1 added some defensive checks that fail for halide,
namely in `python_tutorial_lesson_04_debugging_2`
and `python_tutorial_lesson_05_scheduling_1`.

#6524 (comment) notes:
> * Python calls a Halide-JIT-generated function , which runs with the GIL held.
> * Halide runtime spawns worker threads.
> * The worker threads try to call pybind11's py::print function to emit traces.
> * Pybind11 complains, correctly, that the worker thread doesn't hold the GIL.
>
> Trying to acquire the GIL hangs, because the main thread is still holding it. I tried teaching the main thread to release the GIL (as suggested in #5631), but I still saw hangs when I tried this.

I have tried, and just dropping the lock before calling into halide,
or just acquiring it in `halide_python_print` doesn't work,
we need to do both.

I have verified that the two tests fail without this fix,
and pass with it.

(cherry picked from commit b8eb22d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants