-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37938: [Swift] initial impl of C Data interface #39091
Conversation
This is a nice start. At some point, the Swift implementation will have to participate in integration testing, but in the meantime you could exercise this by testing against PyArrow. Two approaches are possible: |
Thank you! This is good to know. I will look into the C# impl and add tests from Swift to PyArrow. Thank you! |
Added local integration/sanity testing for the c data interface between go and swift. This test is currently only able to run on macox due to an issue with the linker in cgo being unable to load swift lib. Due to this it is not enabled during the docker swift check. |
4361165
to
0eae75a
Compare
let length = UInt(cArray.length) | ||
let nullCount = UInt(cArray.null_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: why UInt
? It looks like this makes you convert back-and-forth between signed and unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The swift implementation was started with UInt for null count and this needs to be changed across the board. I am planning a follow up fix to allow a UInt and a non positive value to be handled by the code.
let pointer = allocator.allocateArray(byteCount) | ||
pointer.copyMemory(from: cBuffer!, byteCount: byteCount) | ||
arrowBuffers.append(ArrowBuffer(length: length, capacity: UInt(byteCount), rawPointer: pointer)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I misunderstanding or is this actually copying the imported data buffers? The essence of the C Data Interface is to be able to share data zero-copy.
Ideally you would be able to create Swift buffers pointing to externally-allocated memory, with some kind of handle that keeps the exported ArrowC
array alive until all Swift buffers pointing to it have been released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, you are correct, this is copying over the data. I will update accordingly.
b64e0a1
to
4af0150
Compare
@pitrou I have made the requested changes. Please review again when you get a chance. |
public func exportType(_ cSchema: inout ArrowC.ArrowSchema, arrowType: ArrowType, name: String = "") -> | ||
Result<Bool, ArrowError> { | ||
do { | ||
cSchema.format = try arrowType.cDataFormatId.cstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a static string for now? At some point you'll need to generate the format string dynamically for parametric data types (see https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value is not static. It is an instance method on the data type so based on the type we can returned dynamic strings if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my point: you need this value to have its lifetime controlled by the release
callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we need to store and then release this data but this comment was in regards to the dynamic nature of the strings (at least that is how I understood this question)
byteCount: Int, | ||
length: UInt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why those two parameters? The C Data Interface does not give you any distinct length/capacity information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are required by the ArrowBuffer. Length is coming from ArrowArray.length and byteCount is determined from the data. These values are required by the ArrowBuffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that the ArrowBuffer constructor takes them. They are just not required as separate arguments for this function :-) It could take a single length argument and pass it for both ArrowBuffer's length and capacity.
(also I don't understand why you are mixing signed and unsigned here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length and capacity are different right? Length is the amount of elements that are in the array and capacity is the amount of bytes for the buffer so theses are not the same
(the sign issue does need to be resolved. I plan to resolve in a follow up PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are different in general, but when ingesting data from the C Data Interface the only information you can compute is the length. You have to make up a dummy capacity value, and using the length is as good a choice as any :-)
(also, the capacity should only be important for growing or shrinking the buffer, which you hopefully won't attempt on a buffer allocated by another runtime :-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I will update to use only the length. Thank you!
arrowBuffers: inout [ArrowBuffer], | ||
byteCount: Int, | ||
length: UInt, | ||
nullCount: UInt? = nil) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this parameter doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
let type = try ArrowType.fromCDataFormatId(cArrow) | ||
return .success(ArrowField(name, type: ArrowType(type.info), isNullable: true)) | ||
} catch { | ||
return .failure(.invalid("\(error)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you care making the error message a bit more detailed?
switch importType( | ||
String(cString: cSchema.format), name: String(cString: cSchema.name)) { | ||
case .success(let field): | ||
release(cSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but is the C schema not released in the failures above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not, I didn't want to perform a release as the developer might be able to handle error and try again. Also, if a failure happens as a developer I would still want my original data to exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Perhaps you can add docstrings to make clear each method's behavior?
|
||
public func importArray( | ||
_ cArray: ArrowC.ArrowArray, | ||
arrowField: ArrowField) -> Result<ArrowArrayHolder, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one does not seem to release the C array on error, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not. I had chatted with @zeroshade about the golang behavior and it doesn't auto release the data but does release the schema. I mirrored the behavior in this impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you can of course design those APIs like this, but just because Arrow Go has a bizarre behavior doesn't make it a good idea to replicate it :-)
For example, Arrow C++ has a consistent behavior where it always releases arrays and schemas that it fails importing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, I didn't look into the C++ behavior for this. I will update to release the array and schemas on failure. Thank you!
// the data associated with this export data | ||
// does not need to be released as they are | ||
// still associated with the ArrowBuffer | ||
// and it will deallocate this memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The release
pointer is responsible for releasing/deallocating any data backing the array.
If you're doing something different, then you're not respecting the C Data Interface's contract.
Which means you probably want to keep the ArrowBuffer
s stored in ExportData
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We keep a reference to the arrowdata object which contains the buffers so the buffers are currently being stored until released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I was confused by the comment, can you perhaps make it less misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
} | ||
} | ||
|
||
return makeArrayHolder(arrowField, buffers: arrowBuffers, nullCount: nullCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so when would the release
pointer be called once the imported array is not referenced anymore?
It seems that currently you're leaking the imported array's memory...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is up to the developer to call to release the memory. This behavior was copied from golang impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be misreading this code, bot how would the developer ensure the data is released when the array reference disappears?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am misunderstanding something. The following is an example of the behavior I am expecting:
- make array in go
- make the carray in go
- call ExportArrowArray to copy the go array to c array structure (this is still go)
- call a method on the swift side that accepts the c array and imports the data into swift
- the above method will perform some operation on the data or can cache the data for future reference.
- the developers swift or go code would need to ensure that the memory is released once it is no longer needed
(on the swift side the ArrowCImporter has a method for releasing the schema and another for releasing the array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so the general idea is that, once the producer (in your example, Go) has exported the data, the consumer (in your example, Swift) is responsible for calling the release
callback once it is done with the imported data.
When designing a higher-level API to abstract away details of the C Data Interface like this, the expectation is that the higher-level API level returns some kind of object (depending on the particular Arrow runtime) that ensures proper lifetime management of the data backing the imported array.
What "proper lifetime management" means depends on the language and runtime, but it should make it so that an imported array has similar semantics as a natively-allocated array. If a natively-allocated array has automatic memory management through reference counting, then so should an imported array.
For example, in Arrow C++, importing a C array backs the resulting C++ Array object with a dedicated Buffer subclass that keeps the imported data alive and calls the release
pointer when the last shared_ptr
reference to it vanishes. The result is that when an imported C++ Array is destroyed, the backing data automatically has its release
pointer called. Just like, when a natively-allocated C++ Array is destroyed, the data backing its buffers is automatically dellocated.
From the looks of it, but I may be misreading, this PR currently doesn't ensure this equivalence between imported arrays and natively-allocated arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if I were to write tests to exercise this using Python code, I would:
- record the number of PyArrow-allocated bytes as
orig_allocated
- allocate a PyArrow array
- record the new number of PyArrow-allocated bytes as
new_allocated
- export the PyArrow array to a C array, and delete the Python array reference
- import the C array into Swift
- check that the number of PyArrow-allocated bytes is still
new_allocated
- delete the Swift array reference
- check that the number of PyArrow-allocated bytes is back to
orig_allocated
(you can imagine a similar scenario, but in reverse, to exercise importing into Python a Swift-exported array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pitrou, gotcha, thank you for the detailed explanation! I have updated the code to return a wrapped ArrayHolder that will manage releasing the memory associated with the C array. I think this will resolve the issue.
Also, I altered the arrayToSwift go test to check if released was called to ensure memory was deallocated. This is not as comprehensive as the test description above but it is a sanity check. I will follow up with a future task to make this more robust as you have explained. Let me know if this works.
case .success: | ||
return | ||
case .failure(let err): | ||
fatalError("\(err)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you return some error code instead of crashing your tests on failure?
FTR, it's ok to test with Go Arrow, but it might be easier with PyArrow using for example PythonKit. |
Also, your unit tests should ideally ensure that memory lifetime is properly handled. There are certainly ways to do that. |
// does not need to be released as they are | ||
// still associated with the ArrowBuffer | ||
// and it will deallocate this memory | ||
ArrowCExporter.exportedData.removeValue(forKey: id.hashValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why you're still using id.hashValue
while you inserted using id
above.
Perhaps check removeValue
's value and error out if the key wasn't actually removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I just removed it. I also removed this function and just call removeValue in the release callback so they will stay in sync. Good catch!
a8166fe
to
a4a0c7c
Compare
@pitrou I believe the requested changes have been completed. Please review again when you get a chance. |
I will look to move the testing to PyArrow in a future PR. |
@pitrou Please review again when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
swift/Arrow/Package.swift
Outdated
name: "ArrowC", //your C/C++ library's name | ||
path: "Sources/ArrowC" //your path to the C/C++ library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: "ArrowC", //your C/C++ library's name | |
path: "Sources/ArrowC" //your path to the C/C++ library | |
name: "ArrowC", // your C/C++ library's name | |
path: "Sources/ArrowC" // your path to the C/C++ library |
fatalError("Export schema not found with id \(exportId)") | ||
} | ||
|
||
// the data associated with this exportschema object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// the data associated with this exportschema object | |
// the data associated with this exportSchema object |
} | ||
|
||
// the data associated with this exportschema object | ||
// which includes the c strings for the format and name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// which includes the c strings for the format and name | |
// which includes the C strings for the format and name |
fatalError("Export data not found with id \(exportId)") | ||
} | ||
|
||
// the data associated with this exportdata object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// the data associated with this exportdata object | |
// the data associated with this exportArray object |
} | ||
|
||
// the data associated with this exportdata object | ||
// which includes the entire arrowdata object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// which includes the entire arrowdata object | |
// which includes the entire arrowData object |
// Not able to set the release on the schema | ||
// to NULL in swift. nil in swift is not | ||
// equivalent to NULL. | ||
void ClearReleaseSchema(struct ArrowSchema*); | ||
|
||
// Not able to set the release on the array | ||
// to NULL in swift. nil in swift is not | ||
// equivalent to NULL. | ||
void ClearReleaseArray(struct ArrowArray*); | ||
|
||
int ArrowSchemaIsReleased(struct ArrowSchema*); | ||
|
||
int ArrowArrayIsReleased(struct ArrowArray*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using a prefix to avoid naming conflict as much as possible?
ArrowSwiftXXX()
?
(If symbols in ArrowC
are loaded with RTLD_LOCAL
, we don't need prefix.)
int ArrowSchemaIsReleased(struct ArrowSchema*); | ||
|
||
int ArrowArrayIsReleased(struct ArrowArray*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that they aren't used. Can we remove them?
}; | ||
|
||
// Not able to set the release on the schema | ||
// to NULL in swift. nil in swift is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// to NULL in swift. nil in swift is not | |
// to NULL in Swift. nil in Swift is not |
void ClearReleaseSchema(struct ArrowSchema*); | ||
|
||
// Not able to set the release on the array | ||
// to NULL in swift. nil in swift is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// to NULL in swift. nil in swift is not | |
// to NULL in Swift. nil in Swift is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @abandy ! Here are some more comments.
init(_ arrowType: ArrowType, name: String = "") throws { | ||
self.arrowType = arrowType | ||
self.arrowTypeName = try arrowType.cDataFormatId.cstring | ||
self.name = name.cstring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure the name
string survives long enough? I see it's not caught in this class' members.
|
||
#include "include/ArrowCData.h" | ||
|
||
void ArrowSwiftClearReleaseSchema(struct ArrowSchema* arrowSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: do you actually need these functions to be implemented in C?
ArrowC.ArrowSwiftClearReleaseSchema(data) | ||
} | ||
} catch { | ||
return .failure(.unknownError("\(error)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- if we arrive here, then the
exportSchema
object is never removed fromexportedData
and therefore leaks until the end of the process, right? - can we fail with a better error message?
cSchema.format = exportSchema.arrowTypeName | ||
cSchema.name = exportSchema.name | ||
cSchema.private_data = | ||
UnsafeMutableRawPointer(mutating: UnsafeRawPointer(bitPattern: exportSchema.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newbie question, but according to the docs, it seems UnsafeMutableRawPointer(bitPattern: exportSchema.id)
should work?
|
||
extension String { | ||
var cstring: UnsafePointer<CChar> { | ||
(self as NSString).cString(using: String.Encoding.utf8.rawValue)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I was still a bit mystified by this, and it appears that using the cString
method actually creates a memory leak, at least on non-Apple platforms:
swiftlang/swift-corelibs-foundation#3885
Instead, perhaps we should use NSString.maximumLengthOfBytes and NSString.getCString so as to control the lifetime of the CChar array.
ArrowCImporter.release(cSchema) | ||
return .success(field) | ||
case .failure(let err): | ||
ArrowCImporter.release(cSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're releasing on failure here, we should also release in the failure cases above (children and dictionary)?
.invalid("Variable buffer count expected 3 but found \(cArray.n_buffers)")) | ||
} | ||
|
||
appendToBuffer(cArray.buffers[0], arrowBuffers: &arrowBuffers, length: length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If appendToBuffer
takes a length in bytes, then the value given is not correct, is it? It should probably be:
- for buffer 0 (null bitmap):
(length + 7) / 8
- for buffer 1 (32-bit offsets):
(length + 1) * 4
- for buffer 2 (string bytes):
offsets[length] - offsets[0]
whereoffsets
is a int32 pointer to the start ofbuffers[1]
} | ||
|
||
appendToBuffer(cArray.buffers[0], arrowBuffers: &arrowBuffers, length: length) | ||
appendToBuffer(cArray.buffers[1], arrowBuffers: &arrowBuffers, length: length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, if we actually care about the buffers' bytelengths:
- for buffer 0 (null bitmap): same calculation as above
- for buffer 1 (primitive values):
(length + 7) / 8
if type is boolean, otherwiselength * sizeof(primitive type)
} | ||
} | ||
|
||
public static func release(_ cSchema: ArrowC.ArrowSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious: why not take a UnsafePointer<ArrowC.ArrowSchema>
directly here?
case ArrowTypeId.date64: | ||
return "tdm" | ||
case ArrowTypeId.time32: | ||
if let time32 = self as? ArrowTypeTime32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this may fail?
It's not obvious to me, are the Go + C Data tests executed on this build? |
Thanks for the comments and the time. Closing PR. |
Continuation for PR: #39091 This add an initial implementation of the C Data interface for swift. During development it was found that null count was not being properly maintained on the arrow buffers and this change is included as well. Also some minor refactoring was done to existing sources to enable this feature. This has been tested from Swift calling into C to import data but not from Swift to C exporting data. Test is currently ongoing. * GitHub Issue: #37938 Authored-by: Alva Bandy <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
This add an initial implementation of the C Data interface for swift. During development it was found that null count was not being properly maintained on the arrow buffers and this change is included as well. Also some minor refactoring was done to existing sources to enable this feature.
This has been tested from Swift calling into C to import data but not from Swift to C exporting data. Test is currently ongoing.