Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal MapArray support #491

Merged
merged 16 commits into from
Jul 31, 2021
Merged

Minimal MapArray support #491

merged 16 commits into from
Jul 31, 2021

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jun 22, 2021

Which issue does this PR close?

Related #395 .

Rationale for this change

To add support for the map array :)

What changes are included in this PR?

This adds minimal support for arrow::array::MapArray and Parquet map support.

Includes

  • arrow::datatype::Map(Box<Field>, bool /* keys_sorted */ )
  • arrow::array::MapArray
  • a really minimal arrow::array::MapArrayBuilder. Note, this API was just to make progress, so more work is needed
  • arrow::ipc::{reader, writer} support
  • arrow::json::reader support
  • Equality support
  • Arrow schema <> Parquet schema
  • Arrow <> Parquet reader
  • Arrow <> Parquet writer
    • Primitive maps

Future work

See #538

Are there any user-facing changes?

Yes, new data type, enum, etc. Builds shall break

@nevi-me nevi-me added the api-change Changes to the arrow API label Jun 22, 2021
@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Jun 22, 2021
@nevi-me nevi-me force-pushed the map-support branch 2 times, most recently from 82fec61 to 2d7ddd5 Compare June 23, 2021 18:37
@nevi-me nevi-me force-pushed the map-support branch 2 times, most recently from 8be09ae to 83d8406 Compare July 3, 2021 08:18
@codecov-commenter
Copy link

codecov-commenter commented Jul 3, 2021

Codecov Report

Merging #491 (ef6944e) into master (2b2ec7e) will increase coverage by 0.03%.
The diff coverage is 80.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #491      +/-   ##
==========================================
+ Coverage   82.48%   82.52%   +0.03%     
==========================================
  Files         167      168       +1     
  Lines       46450    47227     +777     
==========================================
+ Hits        38315    38973     +658     
- Misses       8135     8254     +119     
Impacted Files Coverage Δ
arrow/src/array/data.rs 73.76% <0.00%> (+0.15%) ⬆️
arrow/src/ipc/writer.rs 86.22% <ø> (ø)
arrow/src/array/equal/mod.rs 93.45% <25.00%> (-0.48%) ⬇️
arrow/src/array/equal_json.rs 88.69% <33.33%> (-2.52%) ⬇️
arrow/src/datatypes/field.rs 51.36% <45.00%> (-0.47%) ⬇️
arrow/src/datatypes/datatype.rs 66.08% <50.00%> (-0.74%) ⬇️
parquet/src/arrow/array_reader.rs 77.70% <77.67%> (+0.25%) ⬆️
arrow/src/array/builder.rs 85.97% <81.44%> (-0.29%) ⬇️
arrow/src/util/integration_util.rs 70.10% <81.81%> (+0.54%) ⬆️
parquet/src/arrow/schema.rs 87.35% <81.88%> (-1.18%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2b2ec7e...ef6944e. Read the comment docs.

@nevi-me nevi-me changed the title [DRAFT] Minimal MapArray support Minimal MapArray support Jul 11, 2021
@nevi-me nevi-me mentioned this pull request Jul 11, 2021
5 tasks
@nevi-me nevi-me marked this pull request as ready for review July 11, 2021 12:29
@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 11, 2021

@alamb @jorgecarleitao this is now ready for review

@alamb
Copy link
Contributor

alamb commented Jul 12, 2021

@nevi-me I plan to review this over the next few days

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this whole PR -- and I think it is well written, well commented, and well tested and think it is good to merge in. I had some minor feedback but nothing I think would prevent merging.

Epic work @nevi-me 🏅

Regarding the "experimental API" -- I wonder if we could avoid it. For example, if we just merged this PR after arrow 5.0.0 is released (I hope to cut the branch / make an RC tomorrow) we can freely change the APIs on the master branch until arrow 6.0.0 is cut and not worry about backwards compatibility

arrow/src/array/array.rs Outdated Show resolved Hide resolved
arrow/src/array/array_map.rs Outdated Show resolved Hide resolved
.add_buffer(entry_offsets)
.add_child_data(entry_struct.data().clone())
.build();
let map_array = MapArray::from(map_data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to here, this method seems almost identical to create_from_buffers -- perhaps it could be reused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some slight differences between the two functions to test null values on the one map

arrow/Cargo.toml Outdated Show resolved Hide resolved
arrow/src/array/array_map.rs Outdated Show resolved Hide resolved
mutable.child_data.iter_mut().for_each(|child| {
child.extend(
index,
array.offset() + start,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the array.offset()? I don't understand this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from the struct slice fix, I had rebased it into the map support so it wouldn't block my progress. I've rebased now that the PR is merged.

arrow/src/json/reader.rs Show resolved Hide resolved
parquet/src/arrow/arrow_reader.rs Show resolved Hide resolved
.expect("Failed to read into array!");

for batch in record_batch_reader {
batch.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be valuable to check the contents of the record batches that were read out (in addition to checking they are not errors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky @alamb, because there's no output data to compare what's read to. The parquet test data that's used by parquet-mr, parquet-cpp and parquet-rs don't have equivalent JSON files like other integration tests.

So, the next best thing has been to read the files, and confirm that the reader doesn't error out.

map_field.data_type()
);
}
}
DataType::FixedSizeList(_, _) => unimplemented!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be useful to test in this module as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 14, 2021

@alamb I'm working on addressing your feedback, will update this PR in a few hours

nevi-me added 6 commits July 29, 2021 22:36
This commit adds the MapArray and MapBuilder.
The interfaces are however incomplete at this stage.
A map is a list with some specific rules, so for equality it is the same as a list
@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 29, 2021

@alamb I've addressed your review, PTAL

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the last two commits and they look great @nevi-me !

@nevi-me nevi-me merged commit 9be938e into apache:master Jul 31, 2021
@houqp
Copy link
Member

houqp commented Jul 31, 2021

thank you @nevi-me !

@nevi-me nevi-me deleted the map-support branch August 1, 2021 07:20
@nevi-me nevi-me mentioned this pull request Aug 3, 2021
4 tasks
@nevi-me
Copy link
Contributor Author

nevi-me commented Sep 1, 2021

I read this whole PR -- and I think it is well written, well commented, and well tested and think it is good to merge in. I had some minor feedback but nothing I think would prevent merging.

Epic work @nevi-me 🏅

Regarding the "experimental API" -- I wonder if we could avoid it. For example, if we just merged this PR after arrow 5.0.0 is released (I hope to cut the branch / make an RC tomorrow) we can freely change the APIs on the master branch until arrow 6.0.0 is cut and not worry about backwards compatibility

@alamb, do you think it would be possible to include this in the next 5.x release? I forgot that we never added this PR to active releases, so it hasn't been released yet.

@houqp
Copy link
Member

houqp commented Sep 7, 2021

If this introduces breaking API changes, I think the current policy is we will have to wait for 6.x?

@alamb
Copy link
Contributor

alamb commented Sep 9, 2021

As @houqp mentions, I think this one can not be included into 5.x without breaking compatibility.

Specifically, the introduction of DataType::Map will cause existing code that relies on that enum to fail to compile. Here is an example of that happening in a WIP PR in datafusion - apache/datafusion#984 (apache/datafusion#984 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants