Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35809: [C#] Improvements to the C Data Interface #35810

Merged
merged 15 commits into from
Jul 6, 2023

Conversation

teo-tsirpanis
Copy link
Contributor

@teo-tsirpanis teo-tsirpanis commented May 28, 2023

Rationale for this change

This PR fixes issues identified while reading the code of the Apache.Arrow.C namespace.

What changes are included in this PR?

See each commit message for more details.

Are these changes tested?

Using the existing test suite.

Are there any user-facing changes?

No.

@teo-tsirpanis teo-tsirpanis requested a review from westonpace as a code owner May 28, 2023 23:15
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #35809 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member

kou commented May 28, 2023

See each commit message for more details.

Could you write it in the pull request description because we use squash merge and the merge commit only uses the pull request description.

It seems that this pull request has several logical changes. It may be better that you split this pull request and associated issue to multiple small pull requests/issues for easy to review.

using System.Runtime.InteropServices;
using Apache.Arrow.Memory;

namespace Apache.Arrow.C
{
public static class CArrowArrayExporter
{
#if NET5_0_OR_GREATER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add a comment explaining why this needs .Net 5.0+? Or will it be obvious to a C# developer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnmanagedCallersOnlyAttribute was introduced in .NET 5.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 30, 2023
@@ -203,6 +207,10 @@ private unsafe static void ReleaseArray(CArrowArray* cArray)
private unsafe static void Dispose(void** ptr)
{
GCHandle gch = GCHandle.FromIntPtr((IntPtr)(*ptr));
if (!gch.IsAllocated)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining when this might occur? (perhaps if an exception occurs while exporting the array?)

@@ -184,13 +189,12 @@ private unsafe static void ConvertRecordBatch(ExportedAllocationOwner sharedOwne
cArray->dictionary = null;
}

#if NET5_0_OR_GREATER
[UnmanagedCallersOnly(CallConvs = new[] { typeof(CallConvStdcall) })]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the default calling convention be used instead? I'm not sure stdcall is ok on non-Windows platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #34133 (comment). Ideally we would have used the default calling convention, but that would not be suppported on anything earlier than .NET 5. And in 64-bit platforms the calling convention doesn't matter either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the entire point (mostly) of implementing the C Data Interface is to be compatible with non-.Net producers/consumers. Those are extremely likely to use the platform default. So we should get it right at least when possible, i.e. on .Net >= 5.0.

As for https://stackoverflow.com/questions/34832679/is-the-callingconvention-ignored-in-64-bit-net-applications , does it apply here? It's talking about DllImport, which might be different from this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, keeping the default convention seems more theoretically sound (and forward-looking, perhaps).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @westonpace @lidavidm for opinions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, are you saying to change this:

public delegate* unmanaged[Stdcall]<CArrowArray*, void> release;

into this?

#if NET5_0_OR_GREATER
public delegate* unmanaged<CArrowArray*, void> release;
#else
public delegate* unmanaged[Stdcall]<CArrowArray*, void> release;
#endif

That would cause an incompatible API surface between the assembly compiled for .NET 6 and that compiled for the earlier frameworks. We have two options:

  • Lie and keep the stdcall calling convention on the function pointers.
  • Use the default unmanaged calling convention but support the C interface only on .NET 6+ (we don't target 5 as it is unsupported).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, why is this release member public here? It will be exposed to C Data Interface consumers as the release pointer, but needn't (and probably shouldn't) be part of the Arrow C# API.

Arrow C# API users should only see the high-level import and export methods such as ImportArray and ExportArray.

(an important thing to understand is that the C Data Interface is a binary interface, not an API)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take a look for example at the Arrow C++ implementation, its release functions are entirely private. For example ReleaseExportedSchema below is in the anonymous namespace, which doesn't expose the function publicly:

void ReleaseExportedSchema(struct ArrowSchema* schema) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I should make the members of these structs private? That also works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were some reason for the fields to stay public, this one could also probably be defined as a union between a public IntPtr and a private delegate. (I don't know why the fields would need to be public; I'm pretty sure I just followed the pattern that was present for schemas.)

@@ -203,6 +207,10 @@ private unsafe static void ReleaseArray(CArrowArray* cArray)
private unsafe static void Dispose(void** ptr)
{
GCHandle gch = GCHandle.FromIntPtr((IntPtr)(*ptr));
if (!gch.IsAllocated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively a noop. If the pointer was null, the previous line will throw InvalidOperationException and if it wasn't null then IsAllocated will return true.

The overall change here also means that calling ReleaseArray twice will now throw an exception instead of the second call being a no-op.

public static void Free(void** ptr)
{
GCHandle gch = GCHandle.FromIntPtr((IntPtr)ptr);
if (!gch.IsAllocated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same comment about IsAllocated.)

@@ -184,13 +189,12 @@ private unsafe static void ConvertRecordBatch(ExportedAllocationOwner sharedOwne
cArray->dictionary = null;
}

#if NET5_0_OR_GREATER
[UnmanagedCallersOnly(CallConvs = new[] { typeof(CallConvStdcall) })]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there were some reason for the fields to stay public, this one could also probably be defined as a union between a public IntPtr and a private delegate. (I don't know why the fields would need to be public; I'm pretty sure I just followed the pattern that was present for schemas.)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we might've worked through most of the issues. If I understand it correctly, if someone wants to use 32-bit windows, then they need to use .NET 5 or greater, which seems like a reasonable tradeoff to me.

Given we don't have any 32-bit tests for any of the C# APIs, there may be other issues as well. We can maybe just move forward and worry about 32-bit APIs if a user emerges?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting committer review Awaiting committer review labels Jun 8, 2023
@CurtHagenlocher
Copy link
Contributor

It sounds like we might've worked through most of the issues. If I understand it correctly, if someone wants to use 32-bit windows, then they need to use .NET 5 or greater, which seems like a reasonable tradeoff to me.

I'd understood it differently (though perhaps incorrectly). I thought that using the explicit [Stdcall] would work everywhere except 32-bit non-Windows platforms, while removing the [Stdcall] would prevent things from working on .NET < 5.

@westonpace
Copy link
Member

I'd understood it differently (though perhaps incorrectly). I thought that using the explicit [Stdcall] would work everywhere except 32-bit non-Windows platforms, while removing the [Stdcall] would prevent things from working on .NET < 5.

Yes, looking at the code now I think you're right. I interpreted the above conversation as:

  • This would be safe but lead to an unstable API surface
  • If we make these methods/fields private we don't have to worry about the API surface

However, looking at the code, I don't see that change. Are we still waiting on that change? Or is there some reason this wouldn't work?

@teo-tsirpanis
Copy link
Contributor Author

I will make the fields internal soon; I'm pretty busy these days.

@teo-tsirpanis
Copy link
Contributor Author

I made them internal.

@westonpace
Copy link
Member

I thought the point of making them internal was so that we could use something like this:

#if NET5_0_OR_GREATER
public delegate* unmanaged<CArrowArray*, void> release;
#else
public delegate* unmanaged[Stdcall]<CArrowArray*, void> release;
#endif

However, I don't see that. Also, looks like we will need a rebase.

@teo-tsirpanis
Copy link
Contributor Author

Done @westonpace.

@pitrou
Copy link
Member

pitrou commented Jul 5, 2023

@teo-tsirpanis @westonpace Is there anything left to do here? It would be nice to have this in 13.0.0.

@pitrou pitrou changed the title GH-35809: [C#] Improvements to the C interface. GH-35809: [C#] Improvements to the C Data Interface Jul 6, 2023
@pitrou
Copy link
Member

pitrou commented Jul 6, 2023

I will say again that I think the callbacks should be Cdecl, not Stdcall. Stdcall is the calling convention for WinAPI functions and nothing else, AFAICT. Cdecl is the default calling convention for any other C/C++ code, such as the callbacks that producers will pass in the release member.

See: https://www.codeproject.com/Articles/1388/Calling-Conventions-Demystified

@pitrou
Copy link
Member

pitrou commented Jul 6, 2023

(and, yes, it would only make a difference on 32-bit x86 Windows machines, and even then, perhaps in some cases the two are equivalent)

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not entirely an expert in calling conventions, and I don't fully understand. I do think the default should be cdecl. However, given this only affects extremely rare systems, and we don't even know if there are any users using those systems, I don't want to hold up legitimate fixes while worrying over calling conventions.

@pitrou
Copy link
Member

pitrou commented Jul 6, 2023

Feel free to merge :-)

@westonpace westonpace merged commit 88cb517 into apache:main Jul 6, 2023
@westonpace westonpace removed the awaiting merge Awaiting merge label Jul 6, 2023
@westonpace
Copy link
Member

Sorry, didn't see your comment on #36506 but it looks like you got the change in just in time. Fortunately, it seems CI is passing :) I'll keep monitoring it.

@westonpace
Copy link
Member

All tests passed, thanks for your attention and the cleanup @teo-tsirpanis

@@ -35,15 +35,23 @@ public unsafe struct CArrowArrayStream
///
/// Return value: 0 if successful, an `errno`-compatible error code otherwise.
///</summary>
public delegate* unmanaged[Stdcall]<CArrowArrayStream*, CArrowSchema*, int> get_schema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make these function pointers internal, how are users of these APIs supposed to call them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should not; the importer will take care of calling them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully follow. If I wanted to take some C arrow data (let's say coming from C++) and natively work with it - not using the "normal" C# ArrowTypes, Schema, RecordBatches, etc. but instead manually calling into the native functions on the CArrowArrayStream, that is no longer possible? Why not?

The reason for interoping at the native layer would be for performance - I wouldn't need to allocate a bunch of managed objects just to interact with the Arrow information coming from some other library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #35810 (comment) for an explanation on why the function pointer fields are internal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to only publicly expose the function pointer fields on net5.0+. Keep them internal for netstandard and netfx where they need to be declared Cdecl (and honestly function pointers aren't really meant to be used on netfx and netstandard since they came in C# 9 - https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/configure-language-version).

That way users on net5+ can still call the function pointers, if they need to. For netstandard and netfx, they are no worse off - they can't call them with this change anyway.

@github-actions github-actions bot added the awaiting changes Awaiting changes label Jul 6, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 88cb5170.

There were 2 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C#] Issues with the C data interface implementation.
6 participants