-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33856: [C#] Implement C Data Interface for C# #35496
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, @CurtHagenlocher!
I got some time tonight to take a swipe through and posted some comments here. I haven't looked deep into the Memory changes yet. I will when I get time.
try | ||
{ | ||
ConvertArray(allocationOwner, array.Data, cArray); | ||
cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)Marshal.GetFunctionPointerForDelegate<ReleaseArrowArray>(ReleaseArray); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do this another way. If you add [UnmanagedCallersOnly]
to the private unsafe static void ReleaseArray(CArrowArray* cArray)
method, then this line should just be:
cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)Marshal.GetFunctionPointerForDelegate<ReleaseArrowArray>(ReleaseArray); | |
cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)&ReleaseArray; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This applies for all the function pointers we need to set on these structs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still work correctly if an array were exported via the C API to another bit of managed code which consumed it (via the C API)?
Edited: the bigger problem is that I would still like to support netstandard20, where UnmanagedCallersOnly isn't available. Is it worth using conditional compilation to optimize for .NET 5+?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still work correctly if an array were exported via the C API to another bit of managed code which consumed it (via the C API)?
Yes because it is the release
pointer is an unmanaged function pointer, and not a managed delegate. It is a bit hard to explain. But even though the caller is "managed" (i.e. .NET code), the way the method is invoked is through an unmanged function pointer. So it is still an "UnmanagedCaller".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth using conditional compilation to optimize for .NET 5+?
Maybe trying writing a benchmark test in https://github.com/apache/arrow/tree/main/csharp/test/Apache.Arrow.Benchmarks to compare the difference? If we see major differences, it may make sense to split the compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking at this deeper, the current code may have problems since the managed Delegate might be GC'd.
You must manually keep the delegate from being collected by the garbage collector from managed code. The garbage collector does not track references to unmanaged code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you'd think I'd have remembered this given that I pointed it out in a private email last month :/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've taken care of the lifetime issues and would probably file a work item to test for optimization.
lock (this) | ||
bool IOwnableAllocation.TryAcquire(out IntPtr ptr, out int offset, out int length) | ||
{ | ||
// TODO: implement refcounted buffers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this get addressed in this PR? If not, we should open an issue for it.
Current test failure is:
Need to update: arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs Lines 162 to 176 in 3948c42
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed for high level functionality and confirmed unit tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor nits but this looks very well thought out to me.
} | ||
} | ||
|
||
return new ValueTask<RecordBatch>(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could, potentially, accept a task scheduler and schedule a new task to call get_next
, allowing this to be more accurately async. Though that should definitely be a follow-up and probably depends on what you're interfacing with (e.g. is the underlying stream performing I/O and slow?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @CurtHagenlocher! This is great!
if (cArray->release != null) | ||
{ | ||
throw new ArgumentException("Cannot export array to a struct that is already initialized.", nameof(cArray)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mandate this, since the user can call this with a local uninitialized struct ArrowArray
variable (if called from raw C rather than, say, Python).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the existing pattern in CArrowSchemaExporter. But I also think that the documentation, in saying that "A released structure is indicated by setting its release callback to NULL. Before reading and interpreting a structure’s data, consumers SHOULD check for a NULL release callback and treat it accordingly (probably by erroring out)." suggests that this check is appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then CArrowSchemaExporter
should probably be modified as well.
This is producer code, not consumer code. At this point, the structure is still uninitialized. "Uninitialized" in the C (or C++) sense, that is "may contain any arbitrary bytes", not "zero-initialized".
Producer code therefore shouldn't care about what is already in the structure.
cc @paleolimbot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of why C is awful ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed as part of #35996.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry I missed this and I see that it's been solved. In nanoarrow we definitely assume that pointer output arguments point to uninitialized memory (and strive to not touch that memory until failure is impossible). There are a few places where we do something like
struct ArrowArray tmp;
tmp.release = NULL;
// stuff with tmp that might fail
if (had_error) {
if (tmp.release != NULL) {
tmp.release(&tmp);
}
return;
}
ArrowArrayMove(&tmp, out);
return;
...to simplify (to the extent that anything in C is simple) the error handling.
Benchmark runs are scheduled for baseline = 41ba4fe and contender = 0dca449. 0dca449 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
} | ||
|
||
RecordBatch result = null; | ||
CArrowArray* cArray = CArrowArray.Create(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @CurtHagenlocher, thanks for implementing this! I've started testing using this to support reading Parquet data as Arrow record batches in ParquetSharp.
One concern I've found is that it appears that an imported stream will leak memory as these CArrowArray
instances are allocated for each batch but they're never freed, as ImportedArrowArray.FinalRelease
will call the release
callback but never deallocates the CArrowArray
struct itself.
Is that correct or am I missing something here?
Would it make sense for the imported array to take ownership of the CArrowArray
struct and deallocate it after calling release?
This seems to be an issue for external use of the C Data Interface API too, eg. if I want to return an IArrowArrayStream
from a method I can't free the CArrowArrayStream
that was used to import it until after the user is finished with the stream, which is awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the first issue has already been discovered and there's a PR out for fixing it: #35810
Give me a few minutes to absorb the second issue.
Derp... misread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, my pattern matching was a bit too eager for the first problem. Yes, that looks like a leak and I think your suggestion is right; that on this code path the ImportedArrowArray needs to remember that it owns the allocation.
The second problem feels different, because in the first case we always know that we were the ones who allocated the CArrowArray but I'm not entirely sure we know that about the CArrowArrayStream. It's true for the stream we got from Python in the test case, but that was using a pyarrow-specific API. The flavor of the C API as a whole does suggest that it will usually be the case for the caller to have to allocate the CArrowArrayStream without (I think) quite making it explicit.
On the whole, I suspect that both importers should take a flag which says whether or not to deallocate the structure afterwards. I'm not convinced about the right default for the flag given the relative risks of leaking memory vs deallocating it inappopriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #35988. Please amend it if I've misunderstood the problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, yeah having a flag to say whether the structure should be deallocated would solve both problems if it was public and implemented for both the array and stream. I think keeping the current behaviour of not deallocating makes sense as a default, given a small memory leak is less of an issue than incorrectly deallocating, and then that keeps allocation and deallocation symmetric by default.
It wouldn't be a big deal if only the first problem was solved though, I can solve the second problem with some extra indirection by introducing a wrapper for the imported stream that deallocates the struct on dispose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That solution of moving the struct into the imported array or stream is much nicer
Rationale for this change
This continues implementing the C Data Interface for C# with integration for
ArrowArray
,RecordBatch
and streams.What changes are included in this PR?
CArrowArray
andCArrowStream
to represent the C API structures.IArrowArrayStream
to represent an array stream or record batch reader.CArrowArrayImporter
,CArrowArrayExporter
,CArrowArrayStreamImporter
andCArrowArrayExporter
to marshal between C# and C representations.Are these changes tested?
Yes. Testing is largely done via the Python C API interface.
Are there any user-facing changes?
Yes, this adds new user-facing APIs to import and export C# structures using the C API.
This PR includes breaking changes to public APIs.
The default time unit for Time64Type was previously milliseconds. This does not appear to be valid, so it has been changed to nanoseconds.