GH-33856: [C#] Implement C Data Interface for C# #35496

CurtHagenlocher · 2023-05-08T20:42:43Z

Rationale for this change

This continues implementing the C Data Interface for C# with integration for ArrowArray, RecordBatch and streams.

What changes are included in this PR?

Adds classes CArrowArray and CArrowStream to represent the C API structures.
Adds interface IArrowArrayStream to represent an array stream or record batch reader.
Adds classes CArrowArrayImporter, CArrowArrayExporter, CArrowArrayStreamImporter and CArrowArrayExporter to marshal between C# and C representations.
Augments the native memory representation to support (reasonably safe) ownership of memory by external code.

Are these changes tested?

Yes. Testing is largely done via the Python C API interface.

Are there any user-facing changes?

Yes, this adds new user-facing APIs to import and export C# structures using the C API.

This PR includes breaking changes to public APIs.

The default time unit for Time64Type was previously milliseconds. This does not appear to be valid, so it has been changed to nanoseconds.

Closes: [C#] Implement C Data Interface for C# #33856
Closes: [C#] Implement C Stream Interface in C# #33857

Closes: [C#] Implement C Data Interface for C# #33856

…ement of supported types.

github-actions · 2023-05-08T20:43:03Z

Closes: [C#] Implement C Data Interface for C# #33856

csharp/src/Apache.Arrow/Ipc/IArrowArrayStream.cs

csharp/src/Apache.Arrow/C/CArrowArrayStreamExporter.cs

eerhardt

Thanks for the contribution, @CurtHagenlocher!

I got some time tonight to take a swipe through and posted some comments here. I haven't looked deep into the Memory changes yet. I will when I get time.

eerhardt · 2023-05-10T02:19:44Z

csharp/src/Apache.Arrow/C/CArrowArrayExporter.cs

+            try
+            {
+                ConvertArray(allocationOwner, array.Data, cArray);
+                cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)Marshal.GetFunctionPointerForDelegate<ReleaseArrowArray>(ReleaseArray);


I think we can do this another way. If you add [UnmanagedCallersOnly] to the private unsafe static void ReleaseArray(CArrowArray* cArray) method, then this line should just be:

Suggested change

cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)Marshal.GetFunctionPointerForDelegate<ReleaseArrowArray>(ReleaseArray);

cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)&ReleaseArray;

This applies for all the function pointers we need to set on these structs.

It would still work correctly if an array were exported via the C API to another bit of managed code which consumed it (via the C API)?

Edited: the bigger problem is that I would still like to support netstandard20, where UnmanagedCallersOnly isn't available. Is it worth using conditional compilation to optimize for .NET 5+?

It would still work correctly if an array were exported via the C API to another bit of managed code which consumed it (via the C API)?

Yes because it is the release pointer is an unmanaged function pointer, and not a managed delegate. It is a bit hard to explain. But even though the caller is "managed" (i.e. .NET code), the way the method is invoked is through an unmanged function pointer. So it is still an "UnmanagedCaller".

Is it worth using conditional compilation to optimize for .NET 5+?

Maybe trying writing a benchmark test in https://github.com/apache/arrow/tree/main/csharp/test/Apache.Arrow.Benchmarks to compare the difference? If we see major differences, it may make sense to split the compilation.

Actually, looking at this deeper, the current code may have problems since the managed Delegate might be GC'd.

See https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.marshal.getfunctionpointerfordelegate?view=net-8.0

You must manually keep the delegate from being collected by the garbage collector from managed code. The garbage collector does not track references to unmanaged code.

Yes, you'd think I'd have remembered this given that I pointed it out in a private email last month :/.

I think I've taken care of the lifetime issues and would probably file a work item to test for optimization.

csharp/src/Apache.Arrow/C/CArrowArrayExporter.cs

csharp/src/Apache.Arrow/Memory/ExportedAllocationOwner.cs

csharp/src/Apache.Arrow/Memory/NativeMemoryManager.cs

eerhardt · 2023-05-10T02:42:06Z

csharp/src/Apache.Arrow/Memory/NativeMemoryManager.cs

-            lock (this)
+        bool IOwnableAllocation.TryAcquire(out IntPtr ptr, out int offset, out int length)
+        {
+            // TODO: implement refcounted buffers?


Will this get addressed in this PR? If not, we should open an issue for it.

csharp/test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs

csharp/src/Apache.Arrow/Arrays/ArrowArrayFactory.cs

csharp/src/Apache.Arrow/Arrays/NullArray.cs

eerhardt · 2023-05-10T16:47:03Z

Current test failure is:

Unhandled exception: System.NotSupportedException: JsonArrowType not supported: null
   at Apache.Arrow.IntegrationTest.IntegrationCommand.ToArrowType(JsonArrowType type) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 176
   at Apache.Arrow.IntegrationTest.IntegrationCommand.CreateField(Builder builder, JsonField jsonField) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 152
   at Apache.Arrow.IntegrationTest.IntegrationCommand.<>c__DisplayClass17_1.<CreateSchema>b__0(Builder f) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 145
   at Apache.Arrow.Schema.Builder.Field(Action`1 fieldBuilderAction) in /arrow/csharp/src/Apache.Arrow/Schema.Builder.cs:line 53
   at Apache.Arrow.IntegrationTest.IntegrationCommand.CreateSchema(JsonSchema jsonSchema) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 145
   at Apache.Arrow.IntegrationTest.IntegrationCommand.Validate() in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 75
   at Apache.Arrow.IntegrationTest.IntegrationCommand.Execute() in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 58
   at Apache.Arrow.IntegrationTest.Program.<>c.<<Main>b__0_0>d.MoveNext() in /arrow/csharp/test/Apache.Arrow.IntegrationTest/Program.cs:line 49

Need to update:

arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs

Lines 162 to 176 in 3948c42

    
           private static IArrowType ToArrowType(JsonArrowType type) 
        
           { 
        
               return type.Name switch 
        
               { 
        
                   "bool" => BooleanType.Default, 
        
                   "int" => ToIntArrowType(type), 
        
                   "floatingpoint" => ToFloatingPointArrowType(type), 
        
                   "decimal" => ToDecimalArrowType(type), 
        
                   "binary" => BinaryType.Default, 
        
                   "utf8" => StringType.Default, 
        
                   "fixedsizebinary" => new FixedSizeBinaryType(type.ByteWidth), 
        
                   "date" => ToDateArrowType(type), 
        
                   "time" => ToTimeArrowType(type), 
        
                   "timestamp" => ToTimestampArrowType(type), 
        
                   _ => throw new NotSupportedException($"JsonArrowType not supported: {type.Name}")

csharp/test/Apache.Arrow.Tests/TestData.cs

csharp/test/Apache.Arrow.Tests/NullArrayTests.cs

csharp/test/Apache.Arrow.Tests/CDataInterfacePythonTests.cs

csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs

csharp/src/Apache.Arrow/C/CArrowArray.cs

csharp/src/Apache.Arrow/C/NativeDelegate.cs

davidhcoe

Reviewed for high level functionality and confirmed unit tests pass.

…into CSharp_CAPI

westonpace

A few minor nits but this looks very well thought out to me.

csharp/src/Apache.Arrow/Arrays/NullArray.cs

csharp/src/Apache.Arrow/Memory/ExportedAllocationOwner.cs

csharp/src/Apache.Arrow/C/CArrowArrayImporter.cs

csharp/src/Apache.Arrow/Memory/ImportedAllocationOwner.cs

csharp/src/Apache.Arrow/C/CArrowArrayStreamImporter.cs

westonpace · 2023-05-17T16:44:17Z

csharp/src/Apache.Arrow/C/CArrowArrayStreamImporter.cs

+                    }
+                }
+
+                return new ValueTask<RecordBatch>(result);


You could, potentially, accept a task scheduler and schedule a new task to call get_next, allowing this to be more accurately async. Though that should definitely be a follow-up and probably depends on what you're interfacing with (e.g. is the underlying stream performing I/O and slow?)

eerhardt

Thanks, @CurtHagenlocher! This is great!

pitrou · 2023-05-23T16:17:41Z

csharp/src/Apache.Arrow/C/CArrowArrayExporter.cs

+            if (cArray->release != null)
+            {
+                throw new ArgumentException("Cannot export array to a struct that is already initialized.", nameof(cArray));
+            }


I wouldn't mandate this, since the user can call this with a local uninitialized struct ArrowArray variable (if called from raw C rather than, say, Python).

I was following the existing pattern in CArrowSchemaExporter. But I also think that the documentation, in saying that "A released structure is indicated by setting its release callback to NULL. Before reading and interpreting a structure’s data, consumers SHOULD check for a NULL release callback and treat it accordingly (probably by erroring out)." suggests that this check is appropriate.

Then CArrowSchemaExporter should probably be modified as well.

This is producer code, not consumer code. At this point, the structure is still uninitialized. "Uninitialized" in the C (or C++) sense, that is "may contain any arbitrary bytes", not "zero-initialized".

Producer code therefore shouldn't care about what is already in the structure.

cc @paleolimbot

This is part of why C is awful ;).

Addressed as part of #35996.

I'm sorry I missed this and I see that it's been solved. In nanoarrow we definitely assume that pointer output arguments point to uninitialized memory (and strive to not touch that memory until failure is impossible). There are a few places where we do something like

struct ArrowArray tmp; tmp.release = NULL; // stuff with tmp that might fail if (had_error) { if (tmp.release != NULL) { tmp.release(&tmp); } return; } ArrowArrayMove(&tmp, out); return;

...to simplify (to the extent that anything in C is simple) the error handling.

ursabot · 2023-05-30T21:59:45Z

Benchmark runs are scheduled for baseline = 41ba4fe and contender = 0dca449. 0dca449 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.74% ⬆️0.3%] test-mac-arm
[Finished ⬇️0.33% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.27% ⬆️0.15%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0dca449f ec2-t3-xlarge-us-east-2
[Finished] 0dca449f test-mac-arm
[Finished] 0dca449f ursa-i9-9960x
[Finished] 0dca449f ursa-thinkcentre-m75q
[Finished] 41ba4fe6 ec2-t3-xlarge-us-east-2
[Finished] 41ba4fe6 test-mac-arm
[Finished] 41ba4fe6 ursa-i9-9960x
[Finished] 41ba4fe6 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

adamreeve · 2023-06-08T02:21:48Z

csharp/src/Apache.Arrow/C/CArrowArrayStreamImporter.cs

+                }
+
+                RecordBatch result = null;
+                CArrowArray* cArray = CArrowArray.Create();


Hi @CurtHagenlocher, thanks for implementing this! I've started testing using this to support reading Parquet data as Arrow record batches in ParquetSharp.

One concern I've found is that it appears that an imported stream will leak memory as these CArrowArray instances are allocated for each batch but they're never freed, as ImportedArrowArray.FinalRelease will call the release callback but never deallocates the CArrowArray struct itself.

Is that correct or am I missing something here?

Would it make sense for the imported array to take ownership of the CArrowArray struct and deallocate it after calling release?

This seems to be an issue for external use of the C Data Interface API too, eg. if I want to return an IArrowArrayStream from a method I can't free the CArrowArrayStream that was used to import it until after the user is finished with the stream, which is awkward.

Yes, the first issue has already been discovered and there's a PR out for fixing it: #35810

Give me a few minutes to absorb the second issue.

Derp... misread.

Okay, my pattern matching was a bit too eager for the first problem. Yes, that looks like a leak and I think your suggestion is right; that on this code path the ImportedArrowArray needs to remember that it owns the allocation.

The second problem feels different, because in the first case we always know that we were the ones who allocated the CArrowArray but I'm not entirely sure we know that about the CArrowArrayStream. It's true for the stream we got from Python in the test case, but that was using a pyarrow-specific API. The flavor of the C API as a whole does suggest that it will usually be the case for the caller to have to allocate the CArrowArrayStream without (I think) quite making it explicit.

On the whole, I suspect that both importers should take a flag which says whether or not to deallocate the structure afterwards. I'm not convinced about the right default for the flag given the relative risks of leaking memory vs deallocating it inappopriately.

Filed #35988. Please amend it if I've misunderstood the problems.

Right, yeah having a flag to say whether the structure should be deallocated would solve both problems if it was public and implemented for both the array and stream. I think keeping the current behaviour of not deallocating makes sense as a default, given a small memory leak is less of an issue than incorrectly deallocating, and then that keeps allocation and deallocation symmetric by default.

It wouldn't be a big deal if only the first problem was solved though, I can solve the second problem with some extra indirection by introducing a wrapper for the imported stream that deallocates the struct on dispose.

Thanks! That solution of moving the struct into the imported array or stream is much nicer

CurtHagenlocher added 5 commits May 6, 2023 20:29

Work in progress

1b929bc

Mostly working, but without correct memory management or a full compl…

5f3013c

…ement of supported types.

Small changes

a0dc92b

Implemented complete (if limited) memory allocation strategy.

366225b

Cleanup

c679e2e

CurtHagenlocher requested a review from westonpace as a code owner May 8, 2023 20:42

github-actions bot added Component: C# awaiting review Awaiting review labels May 8, 2023

CurtHagenlocher commented May 8, 2023

View reviewed changes

csharp/src/Apache.Arrow/Ipc/IArrowArrayStream.cs Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 8, 2023

CurtHagenlocher commented May 8, 2023

View reviewed changes

csharp/src/Apache.Arrow/C/CArrowArrayStreamExporter.cs Show resolved Hide resolved

eerhardt self-requested a review May 9, 2023 15:09

wjones127 self-requested a review May 9, 2023 16:57

eerhardt reviewed May 10, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 10, 2023

Improvements as per pull request

67c5044

CurtHagenlocher requested review from assignUser, kou and raulcd as code owners May 10, 2023 04:09

github-actions bot added Component: Documentation awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 10, 2023

Added NullArray support for real, this time with tests.

f740b36

CurtHagenlocher commented May 10, 2023

View reviewed changes

csharp/src/Apache.Arrow/Arrays/NullArray.cs Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 10, 2023

github-actions bot added the awaiting change review Awaiting change review label May 10, 2023

CurtHagenlocher commented May 10, 2023

View reviewed changes

csharp/test/Apache.Arrow.Tests/TestData.cs Outdated Show resolved Hide resolved

CurtHagenlocher added 2 commits May 11, 2023 00:09

Fix null integration test

bfcab3a

Fix Null type serialization to flatbuffers.

4eb9c1f

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 15, 2023

eerhardt reviewed May 15, 2023

View reviewed changes

Make changes suggested by code review

edc74af

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 15, 2023

davidhcoe reviewed May 17, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 17, 2023

CurtHagenlocher added 2 commits May 17, 2023 16:19

Changed the default Time64Type to have a valid time unit.

ad7907b

Merge branch 'CSharp_CAPI' of https://github.com/CurtHagenlocher/arrow …

8541069

…into CSharp_CAPI

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 17, 2023

westonpace approved these changes May 17, 2023

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 17, 2023

Changes suggested by code review.

f816507

davidhcoe mentioned this pull request May 22, 2023

feat(csharp): adding C# functionality apache/arrow-adbc#697

Merged

eerhardt approved these changes May 22, 2023

View reviewed changes

eerhardt merged commit 0dca449 into apache:main May 22, 2023

CurtHagenlocher deleted the CSharp_CAPI branch May 22, 2023 19:49

pitrou reviewed May 23, 2023

View reviewed changes

adamreeve reviewed Jun 8, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33856: [C#] Implement C Data Interface for C# #35496

GH-33856: [C#] Implement C Data Interface for C# #35496

CurtHagenlocher commented May 8, 2023 •

edited

Loading

github-actions bot commented May 8, 2023

eerhardt left a comment

eerhardt May 10, 2023

eerhardt May 10, 2023

CurtHagenlocher May 10, 2023 •

edited

Loading

eerhardt May 10, 2023

eerhardt May 10, 2023

eerhardt May 10, 2023

CurtHagenlocher May 10, 2023

CurtHagenlocher May 10, 2023

eerhardt May 10, 2023

eerhardt commented May 10, 2023

davidhcoe left a comment

westonpace left a comment

westonpace May 17, 2023

eerhardt left a comment

pitrou May 23, 2023

CurtHagenlocher Jun 8, 2023

pitrou Jun 8, 2023

CurtHagenlocher Jun 8, 2023

CurtHagenlocher Jun 8, 2023

paleolimbot Jun 18, 2023

ursabot commented May 30, 2023

adamreeve Jun 8, 2023

CurtHagenlocher Jun 8, 2023 •

edited

Loading

CurtHagenlocher Jun 8, 2023

CurtHagenlocher Jun 8, 2023 •

edited

Loading

adamreeve Jun 8, 2023

CurtHagenlocher Jun 8, 2023

adamreeve Jun 8, 2023

	cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)Marshal.GetFunctionPointerForDelegate<ReleaseArrowArray>(ReleaseArray);
	cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)&ReleaseArray;

GH-33856: [C#] Implement C Data Interface for C# #35496

GH-33856: [C#] Implement C Data Interface for C# #35496

Conversation

CurtHagenlocher commented May 8, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 8, 2023

eerhardt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CurtHagenlocher May 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eerhardt commented May 10, 2023

davidhcoe left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eerhardt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented May 30, 2023

Choose a reason for hiding this comment

CurtHagenlocher Jun 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CurtHagenlocher Jun 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CurtHagenlocher commented May 8, 2023 •

edited

Loading

CurtHagenlocher May 10, 2023 •

edited

Loading

CurtHagenlocher Jun 8, 2023 •

edited

Loading

CurtHagenlocher Jun 8, 2023 •

edited

Loading