Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable numba caching via environment variable #869

Closed
timothymillar opened this issue Jul 3, 2022 · 4 comments · Fixed by #996
Closed

Disable numba caching via environment variable #869

timothymillar opened this issue Jul 3, 2022 · 4 comments · Fixed by #996
Assignees
Milestone

Comments

@timothymillar
Copy link
Collaborator

timothymillar commented Jul 3, 2022

Edit: related to #371

I've recently started experimenting with sgkit on a SLURM cluster which is working well with the exception of methods using guvectorize with cache=True. Calling these functions results in a segmentation fault on the worker. This only seems to be an issue with guvectorize (not the jit or vectorize decorators) and there is no segmentation fault if I set cache=False.

There are a couple of open issues that may be related although neither quite match what I'm seeing (need to dig some more):

There is also an open issue for globally disabling numba caching which would provide a workaround although it might be stale:

In the meantime, for the sake of debugging and workarounds, it'd be useful to be able to disable numba-caching in sgkit using an environment variable.

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 2, 2022

Can this be closed now that #870 is in?

@timothymillar
Copy link
Collaborator Author

Maybe we should leave it open for now to document the SGKIT_DISABLE_NUMBA_CACHE variable. I also wondered if you had a suggestion for testing that setting that variable works as expected in CI?

@benjeffery
Copy link
Collaborator

I've hit this via #1051. Interestingly I get a different error (TypeError: can not serialize 'numpy.int64' object) if I disable task fusion in dask (dask.config.set({"optimization.fuse.active": False}))

@benjeffery
Copy link
Collaborator

benjeffery commented Mar 9, 2023

After much digging I have discovered some interesting things about these segfaults.
As above turning off dask task fusion results in the serialization error above. Digging in to the code this is because we are passing numpy.int64 to some dask methods instead of int. For example if I change:

@wraps(gufunc)
    def func(x: ArrayLike, cohort: ArrayLike, n: int, axis: int = -1) -> ArrayLike:
        x = da.swapaxes(da.asarray(x), axis, -1)

(from cohort_numba_fns.py) to:

@wraps(gufunc)
    def func(x: ArrayLike, cohort: ArrayLike, n: int, axis: int = -1) -> ArrayLike:
        n = int(n)
        axis = int(axis)
        x = da.swapaxes(da.asarray(x), axis, -1)

Then the serialisation error is fixed!

BUT If I then turn dask task fusion back on, the segfault is gone!! So I think that in the fused task a compiled func is expecting an int, but getting a numpy.int64, and then segfaulting?

There are other segfaults still happening - I assume they are for similar issues.

(@jeromekelleher numpy ints strike again!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants