Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33856: [C#] Implement C Data Interface for C# #35496

Merged
merged 14 commits into from
May 22, 2023

Conversation

CurtHagenlocher
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher commented May 8, 2023

Rationale for this change

This continues implementing the C Data Interface for C# with integration for ArrowArray, RecordBatch and streams.

What changes are included in this PR?

  • Adds classes CArrowArray and CArrowStream to represent the C API structures.
  • Adds interface IArrowArrayStream to represent an array stream or record batch reader.
  • Adds classes CArrowArrayImporter, CArrowArrayExporter, CArrowArrayStreamImporter and CArrowArrayExporter to marshal between C# and C representations.
  • Augments the native memory representation to support (reasonably safe) ownership of memory by external code.

Are these changes tested?

Yes. Testing is largely done via the Python C API interface.

Are there any user-facing changes?

Yes, this adds new user-facing APIs to import and export C# structures using the C API.

This PR includes breaking changes to public APIs.

The default time unit for Time64Type was previously milliseconds. This does not appear to be valid, so it has been changed to nanoseconds.

@github-actions
Copy link

github-actions bot commented May 8, 2023

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 8, 2023
@eerhardt eerhardt self-requested a review May 9, 2023 15:09
@wjones127 wjones127 self-requested a review May 9, 2023 16:57
Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, @CurtHagenlocher!

I got some time tonight to take a swipe through and posted some comments here. I haven't looked deep into the Memory changes yet. I will when I get time.

try
{
ConvertArray(allocationOwner, array.Data, cArray);
cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)Marshal.GetFunctionPointerForDelegate<ReleaseArrowArray>(ReleaseArray);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do this another way. If you add [UnmanagedCallersOnly] to the private unsafe static void ReleaseArray(CArrowArray* cArray) method, then this line should just be:

Suggested change
cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)Marshal.GetFunctionPointerForDelegate<ReleaseArrowArray>(ReleaseArray);
cArray->release = (delegate* unmanaged[Stdcall]<CArrowArray*, void>)&ReleaseArray;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies for all the function pointers we need to set on these structs.

Copy link
Contributor Author

@CurtHagenlocher CurtHagenlocher May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still work correctly if an array were exported via the C API to another bit of managed code which consumed it (via the C API)?

Edited: the bigger problem is that I would still like to support netstandard20, where UnmanagedCallersOnly isn't available. Is it worth using conditional compilation to optimize for .NET 5+?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still work correctly if an array were exported via the C API to another bit of managed code which consumed it (via the C API)?

Yes because it is the release pointer is an unmanaged function pointer, and not a managed delegate. It is a bit hard to explain. But even though the caller is "managed" (i.e. .NET code), the way the method is invoked is through an unmanged function pointer. So it is still an "UnmanagedCaller".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth using conditional compilation to optimize for .NET 5+?

Maybe trying writing a benchmark test in https://github.com/apache/arrow/tree/main/csharp/test/Apache.Arrow.Benchmarks to compare the difference? If we see major differences, it may make sense to split the compilation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking at this deeper, the current code may have problems since the managed Delegate might be GC'd.

See https://learn.microsoft.com/en-us/dotnet/api/system.runtime.interopservices.marshal.getfunctionpointerfordelegate?view=net-8.0

You must manually keep the delegate from being collected by the garbage collector from managed code. The garbage collector does not track references to unmanaged code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you'd think I'd have remembered this given that I pointed it out in a private email last month :/.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've taken care of the lifetime issues and would probably file a work item to test for optimization.

csharp/src/Apache.Arrow/C/CArrowArrayExporter.cs Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow/C/CArrowArrayExporter.cs Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow/Memory/NativeMemoryManager.cs Outdated Show resolved Hide resolved
lock (this)
bool IOwnableAllocation.TryAcquire(out IntPtr ptr, out int offset, out int length)
{
// TODO: implement refcounted buffers?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this get addressed in this PR? If not, we should open an issue for it.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 10, 2023
@github-actions github-actions bot added Component: Documentation awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 10, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 10, 2023
@eerhardt
Copy link
Contributor

Current test failure is:

Unhandled exception: System.NotSupportedException: JsonArrowType not supported: null
   at Apache.Arrow.IntegrationTest.IntegrationCommand.ToArrowType(JsonArrowType type) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 176
   at Apache.Arrow.IntegrationTest.IntegrationCommand.CreateField(Builder builder, JsonField jsonField) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 152
   at Apache.Arrow.IntegrationTest.IntegrationCommand.<>c__DisplayClass17_1.<CreateSchema>b__0(Builder f) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 145
   at Apache.Arrow.Schema.Builder.Field(Action`1 fieldBuilderAction) in /arrow/csharp/src/Apache.Arrow/Schema.Builder.cs:line 53
   at Apache.Arrow.IntegrationTest.IntegrationCommand.CreateSchema(JsonSchema jsonSchema) in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 145
   at Apache.Arrow.IntegrationTest.IntegrationCommand.Validate() in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 75
   at Apache.Arrow.IntegrationTest.IntegrationCommand.Execute() in /arrow/csharp/test/Apache.Arrow.IntegrationTest/IntegrationCommand.cs:line 58
   at Apache.Arrow.IntegrationTest.Program.<>c.<<Main>b__0_0>d.MoveNext() in /arrow/csharp/test/Apache.Arrow.IntegrationTest/Program.cs:line 49

Need to update:

private static IArrowType ToArrowType(JsonArrowType type)
{
return type.Name switch
{
"bool" => BooleanType.Default,
"int" => ToIntArrowType(type),
"floatingpoint" => ToFloatingPointArrowType(type),
"decimal" => ToDecimalArrowType(type),
"binary" => BinaryType.Default,
"utf8" => StringType.Default,
"fixedsizebinary" => new FixedSizeBinaryType(type.ByteWidth),
"date" => ToDateArrowType(type),
"time" => ToTimeArrowType(type),
"timestamp" => ToTimestampArrowType(type),
_ => throw new NotSupportedException($"JsonArrowType not supported: {type.Name}")

@github-actions github-actions bot added the awaiting change review Awaiting change review label May 10, 2023
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 15, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 15, 2023
Copy link
Contributor

@davidhcoe davidhcoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for high level functionality and confirmed unit tests pass.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 17, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 17, 2023
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor nits but this looks very well thought out to me.

csharp/src/Apache.Arrow/Arrays/NullArray.cs Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow/Arrays/NullArray.cs Show resolved Hide resolved
csharp/src/Apache.Arrow/Memory/ImportedAllocationOwner.cs Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow/C/CArrowArrayStreamImporter.cs Outdated Show resolved Hide resolved
csharp/src/Apache.Arrow/C/CArrowArrayStreamImporter.cs Outdated Show resolved Hide resolved
}
}

return new ValueTask<RecordBatch>(result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, potentially, accept a task scheduler and schedule a new task to call get_next, allowing this to be more accurately async. Though that should definitely be a follow-up and probably depends on what you're interfacing with (e.g. is the underlying stream performing I/O and slow?)

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 17, 2023
Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @CurtHagenlocher! This is great!

@eerhardt eerhardt merged commit 0dca449 into apache:main May 22, 2023
@CurtHagenlocher CurtHagenlocher deleted the CSharp_CAPI branch May 22, 2023 19:49
Comment on lines +52 to +55
if (cArray->release != null)
{
throw new ArgumentException("Cannot export array to a struct that is already initialized.", nameof(cArray));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mandate this, since the user can call this with a local uninitialized struct ArrowArray variable (if called from raw C rather than, say, Python).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the existing pattern in CArrowSchemaExporter. But I also think that the documentation, in saying that "A released structure is indicated by setting its release callback to NULL. Before reading and interpreting a structure’s data, consumers SHOULD check for a NULL release callback and treat it accordingly (probably by erroring out)." suggests that this check is appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then CArrowSchemaExporter should probably be modified as well.

This is producer code, not consumer code. At this point, the structure is still uninitialized. "Uninitialized" in the C (or C++) sense, that is "may contain any arbitrary bytes", not "zero-initialized".

Producer code therefore shouldn't care about what is already in the structure.

cc @paleolimbot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of why C is awful ;).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed as part of #35996.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry I missed this and I see that it's been solved. In nanoarrow we definitely assume that pointer output arguments point to uninitialized memory (and strive to not touch that memory until failure is impossible). There are a few places where we do something like

struct ArrowArray tmp;
tmp.release = NULL;
// stuff with tmp that might fail
if (had_error) {
  if (tmp.release != NULL) {
    tmp.release(&tmp);
  }

  return;
}

ArrowArrayMove(&tmp, out);
return;

...to simplify (to the extent that anything in C is simple) the error handling.

@ursabot
Copy link

ursabot commented May 30, 2023

Benchmark runs are scheduled for baseline = 41ba4fe and contender = 0dca449. 0dca449 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.74% ⬆️0.3%] test-mac-arm
[Finished ⬇️0.33% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.27% ⬆️0.15%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 0dca449f ec2-t3-xlarge-us-east-2
[Finished] 0dca449f test-mac-arm
[Finished] 0dca449f ursa-i9-9960x
[Finished] 0dca449f ursa-thinkcentre-m75q
[Finished] 41ba4fe6 ec2-t3-xlarge-us-east-2
[Finished] 41ba4fe6 test-mac-arm
[Finished] 41ba4fe6 ursa-i9-9960x
[Finished] 41ba4fe6 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

}

RecordBatch result = null;
CArrowArray* cArray = CArrowArray.Create();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CurtHagenlocher, thanks for implementing this! I've started testing using this to support reading Parquet data as Arrow record batches in ParquetSharp.

One concern I've found is that it appears that an imported stream will leak memory as these CArrowArray instances are allocated for each batch but they're never freed, as ImportedArrowArray.FinalRelease will call the release callback but never deallocates the CArrowArray struct itself.

Is that correct or am I missing something here?

Would it make sense for the imported array to take ownership of the CArrowArray struct and deallocate it after calling release?

This seems to be an issue for external use of the C Data Interface API too, eg. if I want to return an IArrowArrayStream from a method I can't free the CArrowArrayStream that was used to import it until after the user is finished with the stream, which is awkward.

Copy link
Contributor Author

@CurtHagenlocher CurtHagenlocher Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the first issue has already been discovered and there's a PR out for fixing it: #35810

Give me a few minutes to absorb the second issue.

Derp... misread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, my pattern matching was a bit too eager for the first problem. Yes, that looks like a leak and I think your suggestion is right; that on this code path the ImportedArrowArray needs to remember that it owns the allocation.

The second problem feels different, because in the first case we always know that we were the ones who allocated the CArrowArray but I'm not entirely sure we know that about the CArrowArrayStream. It's true for the stream we got from Python in the test case, but that was using a pyarrow-specific API. The flavor of the C API as a whole does suggest that it will usually be the case for the caller to have to allocate the CArrowArrayStream without (I think) quite making it explicit.

On the whole, I suspect that both importers should take a flag which says whether or not to deallocate the structure afterwards. I'm not convinced about the right default for the flag given the relative risks of leaking memory vs deallocating it inappopriately.

Copy link
Contributor Author

@CurtHagenlocher CurtHagenlocher Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #35988. Please amend it if I've misunderstood the problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, yeah having a flag to say whether the structure should be deallocated would solve both problems if it was public and implemented for both the array and stream. I think keeping the current behaviour of not deallocating makes sense as a default, given a small memory leak is less of an issue than incorrectly deallocating, and then that keeps allocation and deallocation symmetric by default.

It wouldn't be a big deal if only the first problem was solved though, I can solve the second problem with some extra indirection by introducing a wrapper for the imported stream that deallocates the struct on dispose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That solution of moving the struct into the imported array or stream is much nicer

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Jun 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C#] Implement C Stream Interface in C# [C#] Implement C Data Interface for C#
8 participants