-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35809: [C#] Improvements to the C Data Interface #35810
Conversation
This patches a memory leak.
The last error message is included if it exists, and in `ReadNextRecordBatchAsync` the exception was returned with the `ValueTask`.
There was a reason they were skipped, as explained in the PR that added them.
|
Could you write it in the pull request description because we use squash merge and the merge commit only uses the pull request description. It seems that this pull request has several logical changes. It may be better that you split this pull request and associated issue to multiple small pull requests/issues for easy to review. |
using System.Runtime.InteropServices; | ||
using Apache.Arrow.Memory; | ||
|
||
namespace Apache.Arrow.C | ||
{ | ||
public static class CArrowArrayExporter | ||
{ | ||
#if NET5_0_OR_GREATER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a comment explaining why this needs .Net 5.0+? Or will it be obvious to a C# developer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnmanagedCallersOnlyAttribute
was introduced in .NET 5.
@@ -203,6 +207,10 @@ private unsafe static void ReleaseArray(CArrowArray* cArray) | |||
private unsafe static void Dispose(void** ptr) | |||
{ | |||
GCHandle gch = GCHandle.FromIntPtr((IntPtr)(*ptr)); | |||
if (!gch.IsAllocated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining when this might occur? (perhaps if an exception occurs while exporting the array?)
@@ -184,13 +189,12 @@ private unsafe static void ConvertRecordBatch(ExportedAllocationOwner sharedOwne | |||
cArray->dictionary = null; | |||
} | |||
|
|||
#if NET5_0_OR_GREATER | |||
[UnmanagedCallersOnly(CallConvs = new[] { typeof(CallConvStdcall) })] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the default calling convention be used instead? I'm not sure stdcall
is ok on non-Windows platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #34133 (comment). Ideally we would have used the default calling convention, but that would not be suppported on anything earlier than .NET 5. And in 64-bit platforms the calling convention doesn't matter either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the entire point (mostly) of implementing the C Data Interface is to be compatible with non-.Net producers/consumers. Those are extremely likely to use the platform default. So we should get it right at least when possible, i.e. on .Net >= 5.0.
As for https://stackoverflow.com/questions/34832679/is-the-callingconvention-ignored-in-64-bit-net-applications , does it apply here? It's talking about DllImport
, which might be different from this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, keeping the default convention seems more theoretically sound (and forward-looking, perhaps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @westonpace @lidavidm for opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, are you saying to change this:
public delegate* unmanaged[Stdcall]<CArrowArray*, void> release; |
into this?
#if NET5_0_OR_GREATER
public delegate* unmanaged<CArrowArray*, void> release;
#else
public delegate* unmanaged[Stdcall]<CArrowArray*, void> release;
#endif
That would cause an incompatible API surface between the assembly compiled for .NET 6 and that compiled for the earlier frameworks. We have two options:
- Lie and keep the
stdcall
calling convention on the function pointers. - Use the default unmanaged calling convention but support the C interface only on .NET 6+ (we don't target 5 as it is unsupported).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, why is this release
member public here? It will be exposed to C Data Interface consumers as the release
pointer, but needn't (and probably shouldn't) be part of the Arrow C# API.
Arrow C# API users should only see the high-level import and export methods such as ImportArray
and ExportArray
.
(an important thing to understand is that the C Data Interface is a binary interface, not an API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take a look for example at the Arrow C++ implementation, its release functions are entirely private. For example ReleaseExportedSchema
below is in the anonymous namespace, which doesn't expose the function publicly:
arrow/cpp/src/arrow/c/bridge.cc
Line 107 in e628ca5
void ReleaseExportedSchema(struct ArrowSchema* schema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I should make the members of these structs private? That also works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were some reason for the fields to stay public, this one could also probably be defined as a union between a public IntPtr and a private delegate. (I don't know why the fields would need to be public; I'm pretty sure I just followed the pattern that was present for schemas.)
@@ -203,6 +207,10 @@ private unsafe static void ReleaseArray(CArrowArray* cArray) | |||
private unsafe static void Dispose(void** ptr) | |||
{ | |||
GCHandle gch = GCHandle.FromIntPtr((IntPtr)(*ptr)); | |||
if (!gch.IsAllocated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively a noop. If the pointer was null, the previous line will throw InvalidOperationException and if it wasn't null then IsAllocated will return true.
The overall change here also means that calling ReleaseArray twice will now throw an exception instead of the second call being a no-op.
public static void Free(void** ptr) | ||
{ | ||
GCHandle gch = GCHandle.FromIntPtr((IntPtr)ptr); | ||
if (!gch.IsAllocated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Same comment about IsAllocated.)
@@ -184,13 +189,12 @@ private unsafe static void ConvertRecordBatch(ExportedAllocationOwner sharedOwne | |||
cArray->dictionary = null; | |||
} | |||
|
|||
#if NET5_0_OR_GREATER | |||
[UnmanagedCallersOnly(CallConvs = new[] { typeof(CallConvStdcall) })] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there were some reason for the fields to stay public, this one could also probably be defined as a union between a public IntPtr and a private delegate. (I don't know why the fields would need to be public; I'm pretty sure I just followed the pattern that was present for schemas.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we might've worked through most of the issues. If I understand it correctly, if someone wants to use 32-bit windows, then they need to use .NET 5 or greater, which seems like a reasonable tradeoff to me.
Given we don't have any 32-bit tests for any of the C# APIs, there may be other issues as well. We can maybe just move forward and worry about 32-bit APIs if a user emerges?
I'd understood it differently (though perhaps incorrectly). I thought that using the explicit [Stdcall] would work everywhere except 32-bit non-Windows platforms, while removing the [Stdcall] would prevent things from working on .NET < 5. |
Yes, looking at the code now I think you're right. I interpreted the above conversation as:
However, looking at the code, I don't see that change. Are we still waiting on that change? Or is there some reason this wouldn't work? |
I will make the fields |
I made them internal. |
I thought the point of making them
However, I don't see that. Also, looks like we will need a rebase. |
ca13a32
to
12d9695
Compare
12d9695
to
410a710
Compare
410a710
to
d7d830f
Compare
Done @westonpace. |
@teo-tsirpanis @westonpace Is there anything left to do here? It would be nice to have this in 13.0.0. |
I will say again that I think the callbacks should be See: https://www.codeproject.com/Articles/1388/Calling-Conventions-Demystified |
(and, yes, it would only make a difference on 32-bit x86 Windows machines, and even then, perhaps in some cases the two are equivalent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not entirely an expert in calling conventions, and I don't fully understand. I do think the default should be cdecl. However, given this only affects extremely rare systems, and we don't even know if there are any users using those systems, I don't want to hold up legitimate fixes while worrying over calling conventions.
Feel free to merge :-) |
Sorry, didn't see your comment on #36506 but it looks like you got the change in just in time. Fortunately, it seems CI is passing :) I'll keep monitoring it. |
All tests passed, thanks for your attention and the cleanup @teo-tsirpanis |
@@ -35,15 +35,23 @@ public unsafe struct CArrowArrayStream | |||
/// | |||
/// Return value: 0 if successful, an `errno`-compatible error code otherwise. | |||
///</summary> | |||
public delegate* unmanaged[Stdcall]<CArrowArrayStream*, CArrowSchema*, int> get_schema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make these function pointers internal
, how are users of these APIs supposed to call them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should not; the importer will take care of calling them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully follow. If I wanted to take some C arrow data (let's say coming from C++) and natively work with it - not using the "normal" C# ArrowTypes, Schema, RecordBatches, etc. but instead manually calling into the native functions on the CArrowArrayStream, that is no longer possible? Why not?
The reason for interoping at the native layer would be for performance - I wouldn't need to allocate a bunch of managed objects just to interact with the Arrow information coming from some other library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #35810 (comment) for an explanation on why the function pointer fields are internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to only publicly expose the function pointer fields on net5.0+
. Keep them internal
for netstandard and netfx where they need to be declared Cdecl
(and honestly function pointers aren't really meant to be used on netfx and netstandard since they came in C# 9 - https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/configure-language-version).
That way users on net5+ can still call the function pointers, if they need to. For netstandard and netfx, they are no worse off - they can't call them with this change anyway.
Conbench analyzed the 6 benchmark runs on commit There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. |
Rationale for this change
This PR fixes issues identified while reading the code of the
Apache.Arrow.C
namespace.What changes are included in this PR?
See each commit message for more details.
Are these changes tested?
Using the existing test suite.
Are there any user-facing changes?
No.