-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [c++/python] Querying DataFrame with pyarrow array coords in multiprocessing returns invalid result #1456
Comments
@atolopko-czi assigning to you for now pending the initial repro/etc writeup (we can re-assign from there) |
Still occurs even if we use a single child process in a serial manner with:
|
wondering if there’s something in a C++ layer that is getting an arrow array but is forgetting to look at the offset within the array |
Hi @nguyenv -- re sequencing -- for this coming sprint #866 is of course priority 1 -- @atolopko-czi and I believe that this issue would be priority 2, and #1256 #1257 et al. ("Create ...") thereafter |
Describe the bug
If the
DataFrame.read()
method is called in multiple child processes, where thecoords
arg for each child process is a disjoint slice of apyarrow.Array
object, then the result of theread()
in each child process is the same, returning the first process's slice of the array in all cases. This does not occur if the slice is first converted to a Python list.To Reproduce
Output:
Versions (please complete the following information):
Additional context
Issue was originally encountered in CELLxGENE Census API PyTorch feature: chanzuckerberg/cellxgene-census#516. See PR discussion).
The text was updated successfully, but these errors were encountered: