-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1798: [C++] Review logic around automatic assignment of field_id's #10289
PARQUET-1798: [C++] Review logic around automatic assignment of field_id's #10289
Conversation
CC @emkornfield Are you able to review this? Not sure who I should ping for a parquet change. |
I'll try to look tonight or tomorrow morning. Otherwise, Antoine is likely the best person. |
cc @TGooch44 |
49c3060
to
bd4a8fb
Compare
I'm not even sure what field_ids are supposed to be for. The parquet spec only has this to say: /** When the original schema supports field ids, this will save the
* original field id in the parquet schema
*/
9: optional i32 field_id; I suppose "original schema" means something non-Parquet, but what? Is it just some kind of arbitrary application-defined id? |
Based on my understanding, it seems that we should:
|
https://issues.apache.org/jira/browse/PARQUET-951 informs field IDs a little bit better. It is from other systems, in this case protobuf (and I imagine thrift might also have something similar) has each field in a message annotated with a unique ID. Based on this I agree with Antoine's assessment, haven't actually looked at the code (is this not what is done?). |
That should simplify things. Just to clarify, this will be a bit of a regression as we currently auto-generate field IDs today.
Correct. We already pulled the field id out of thrift and into Arrow metadata. The only problem was that the logic to do the reverse was missing. This PR is only adding that. There could be some follow-up work for integrating with other parts of the Arrow ecosystem. I will send some questions to the ML. |
bd4a8fb
to
6b372d0
Compare
Per @pitrou 's suggestion I have removed the logic auto-generating field_id entirely. I also added a python test to ensure things are working full path. This is ready for review again. |
6e80fe9
to
9349f6e
Compare
Ok, looks like I jumped the gun with the last comment. Removing the old auto-generation behavior broke some tests I wasn't looking at. They are fixed now. I believe the CI failures are unrelated at this point. I may force-push tomorrow just for good measure. Review is welcome. |
c11452d
to
85a860b
Compare
I've rebased (and verified again that the build failures are unrelated). Gentle ping for review @emkornfield / @pitrou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, some comments and questions below.
# } | ||
# optional binary field_id=5 f2; | ||
# } | ||
|
||
field_name = b'PARQUET:field_id' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already have it named field_id
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed to use the existing variable.
cpp/src/parquet/schema_test.cc
Outdated
@@ -171,17 +167,16 @@ TEST_F(TestPrimitiveNode, Attrs) { | |||
} | |||
|
|||
TEST_F(TestPrimitiveNode, FromParquet) { | |||
SchemaElement elt = | |||
NewPrimitive(name_, FieldRepetitionType::OPTIONAL, Type::INT32, field_id_); | |||
SchemaElement elt = NewPrimitive(name_, FieldRepetitionType::OPTIONAL, Type::INT32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this change in the tests. The user should still be able to pass an explicit field_id
when creating a schema node, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the user does not need this capability. See comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, let's say I want to create a Node
with a given field id as was done in this test. Would you I do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see your point. These tests were not testing FromParquet
so the field id setting was still valid. I have restored them.
TEST_F(TestConvertRoundTrip, GroupIdMissingIfNotSpecified) { | ||
std::vector<std::shared_ptr<Field>> arrow_fields; | ||
arrow_fields.push_back(::arrow::field("simple", ::arrow::int32(), false)); | ||
/// { "nested": { "outer": { "inner" }, "sibling" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing closing brace here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
return field_ids; | ||
} | ||
|
||
TEST_F(TestConvertRoundTrip, GroupIdMissingIfNotSpecified) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "GroupId"? Should this be "FieldId"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not sure where Group came from. Fixed.
return ::arrow::key_value_metadata({"PARQUET:field_id"}, {std::to_string(field_id)}); | ||
} | ||
|
||
TEST_F(TestConvertRoundTrip, GroupIdPreserveExisting) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... this test only checks that Arrow metadata is preserved, right? It doesn't test that the metadata is actually converted into a field_id on the Parquet schema node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps TestConvertRoundTrip::ConvertSchema
should be renamed to RoundTripSchema
. It does the following transformations...
vector<Field>
->arrow::Schema
arrow::Schema
->parquet::SchemaDescriptor
parquet::SchemaDescriptor
->vector<parquet::format::SchemaElement>
vector<parquet::format::SchemaElement>
->parquet::SchemaDescriptor
parquet::SchemaDescriptor
->arrow::Schema
So I believe it does indeed test the Parquet schema node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to test that the PARQUET:field_id
Arrow annotation ends up in the Parquet field_id
member, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now verify the field IDs at all three levels (round-tripped arrow, parquet, and thrift).
… to parquet file. On read these existing field ids will be preferred to the auto-generating (dfs-assignment) algorithm.
… in my test but thrift_internal is what I was looking for
…metadata that I had missed. Now that the old behavior changed it was invalid. So I removed my new test and updated the existing test.
85a860b
to
7a93325
Compare
As an extra level of sanity-checking I created a parquet file with Arrow and then read it in with fastparquet and verified the field_id is correct (both for a missing field_id and a valid field_id). |
Provided this passes CI (I'll check in the morning) I believe I have addressed all concerns. |
Thanks a lot for checking! |
AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/builds/39351310 |
…_id's Questions: - This is my first PR in the parquet namespace, I'm not sure of all the special rules. - The field ID generation doesn't happen on the `parquet::schema` -> `arrow::schema` phase but on the `parquet::format::schema` -> `parquet::schema` phase. So in order to test I had to add `#include "generated/parquet_types.h"` to `arrow_schema_test.cc` and I wasn't sure if I was allowed to reference the `generated/*` files like that. - This PR simply allows user specified field id's to be persisted. Is that sufficient for PARQUET-1798 (the title is rather general) or should I open up a dedicated JIRA? Closes apache#10289 from westonpace/feature/PARQUET-1798-field-id-assignment Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Questions:
parquet::schema
->arrow::schema
phase but on theparquet::format::schema
->parquet::schema
phase. So in order to test I had to add#include "generated/parquet_types.h"
toarrow_schema_test.cc
and I wasn't sure if I was allowed to reference thegenerated/*
files like that.