Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary support for C data interface #1407

Merged
merged 3 commits into from
Mar 9, 2022
Merged

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Mar 8, 2022

Which issue does this PR close?

Closes #1397.

Rationale for this change

Currently the Rust implementation of C data interface doesn't support dictionary type yet, which is necessary if we want to pass data between Rust and other languages such as Java/C++.

What changes are included in this PR?

  • Added dictionary support in Rust FFI.
  • Added roundtrip tests within Rust (i.e., Rust -> Rust -> Rust)
  • Added roundtrip tests between Python and Rust

I kept the DictionaryArray untouched so the dictionary (i.e., values) is still stored as the only child of the ArrayData it is associated to.

Are there any user-facing changes?

Yes, now we can use C data interface with dictionary type. There is one API change on ArrowSchema.try_new:

    pub fn try_new(
        format: &str,
        children: Vec<FFI_ArrowSchema>,
        dictionary: Option<FFI_ArrowSchema>,
    ) -> Result<Self>

This PR added one parameter dictionary.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 8, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2022

Codecov Report

Merging #1407 (7ed4d37) into master (1fbe618) will increase coverage by 0.02%.
The diff coverage is 82.60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1407      +/-   ##
==========================================
+ Coverage   83.13%   83.16%   +0.02%     
==========================================
  Files         182      182              
  Lines       53321    53394      +73     
==========================================
+ Hits        44330    44404      +74     
+ Misses       8991     8990       -1     
Impacted Files Coverage Δ
arrow/src/datatypes/ffi.rs 73.65% <68.75%> (+2.80%) ⬆️
arrow/src/ffi.rs 85.78% <90.19%> (+1.16%) ⬆️
arrow/src/array/ffi.rs 100.00% <100.00%> (ø)
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
arrow/src/datatypes/field.rs 53.79% <0.00%> (-0.31%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.52% <0.00%> (-0.20%) ⬇️
arrow/src/array/array_binary.rs 93.28% <0.00%> (+0.05%) ⬆️
arrow/src/array/transform/mod.rs 86.43% <0.00%> (+0.11%) ⬆️
arrow/src/array/equal/mod.rs 93.24% <0.00%> (+0.33%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fbe618...7ed4d37. Read the comment docs.

@sunchao
Copy link
Member Author

sunchao commented Mar 8, 2022

cc @alamb @jorgecarleitao @kszucs could you help to review this? thanks!

@sunchao
Copy link
Member Author

sunchao commented Mar 8, 2022

We should also be able to pass is_ordered flag via the C data interface, but it appears that property is not supported in the Rust implementation yet (perhaps it should be part of DataType::Dictionary).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @sunchao

I wonder if @viirya may have some time to review this PR as well as he has been working in the FFI interface recently as well

@@ -122,14 +123,6 @@ def test_type_roundtrip_raises(pyarrow_type):
with pytest.raises(pa.ArrowException):
rust.round_trip_type(pyarrow_type)


def test_dictionary_type_roundtrip():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

// verify
let new_values = vec!["a", "aaa", "aaa", "a", "aaa", "aaa"];
let expected: DictionaryArray<Int8Type> = new_values.into_iter().collect();
assert_eq!(actual, &expected);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆗

@viirya
Copy link
Member

viirya commented Mar 8, 2022

Yea, I will review this PR. Thanks @alamb @sunchao

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @sunchao . Overall looks great! Left 3 comments below.

arrow/src/ffi.rs Outdated
let data_type = &self.data_type()?;
// Special handling for dictionary type as we only care about the key type in the case.
let data_type = match &self.data_type()? {
DataType::Dictionary(key_data_type, _) => key_data_type.as_ref().clone(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would be possible to not clone here by returning a reference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we can do this. I need to move &self.data_type()? into a separate assignment though since otherwise I'm getting "temporary value dropped while borrowed" error.

@@ -127,4 +128,14 @@ mod tests {
let data = array.data();
test_round_trip(data)
}

#[test]
fn test_dictionary() -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth to have an example with validity (in both the keys and values), so that we cover the most complex use-case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will add.

"""
Python -> Rust -> Python
"""
a = pa.array(["a", "b", "a"], type=pa.dictionary(pa.int8(), pa.string()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - with validities?

.map(|i| {
let child = self.child(i);
child.to_data()
})
.map(|d| d.unwrap())
.collect();

if let Some(d) = self.dictionary() {
// For dictionary type there should only be a single child, so we don't need to worry if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add assert for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@viirya
Copy link
Member

viirya commented Mar 8, 2022

Yes, now we can use C data interface with dictionary type. No API change.

FFI_ArrowSchema.try_new is a public function, so I wonder this should be an API change?

@sunchao
Copy link
Member Author

sunchao commented Mar 8, 2022

FFI_ArrowSchema.try_new is a public function, so I wonder this should be an API change?

Hmm you are right. Looks like try_new doesn't have to be public though. I think people should just use ArrowArray directly.

@sunchao
Copy link
Member Author

sunchao commented Mar 9, 2022

Thanks @jorgecarleitao @alamb @viirya ! I've addressed your comments. Let me know if you have more feedback. Otherwise I'll merge this tonight.

@sunchao sunchao merged commit f19d1ed into apache:master Mar 9, 2022
@sunchao
Copy link
Member Author

sunchao commented Mar 9, 2022

Merged, thanks!

@sunchao sunchao deleted the ffi-dict branch March 9, 2022 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support dictionary array in C data interface
5 participants