Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1798: [C++] Review logic around automatic assignment of field_id's #10289

Conversation

westonpace
Copy link
Member

Questions:

  • This is my first PR in the parquet namespace, I'm not sure of all the special rules.
  • The field ID generation doesn't happen on the parquet::schema -> arrow::schema phase but on the parquet::format::schema -> parquet::schema phase. So in order to test I had to add #include "generated/parquet_types.h" to arrow_schema_test.cc and I wasn't sure if I was allowed to reference the generated/* files like that.
  • This PR simply allows user specified field id's to be persisted. Is that sufficient for PARQUET-1798 (the title is rather general) or should I open up a dedicated JIRA?

@github-actions
Copy link

@westonpace
Copy link
Member Author

CC @emkornfield Are you able to review this? Not sure who I should ping for a parquet change.

@emkornfield
Copy link
Contributor

I'll try to look tonight or tomorrow morning. Otherwise, Antoine is likely the best person.

@wesm
Copy link
Member

wesm commented May 10, 2021

cc @TGooch44

@westonpace westonpace force-pushed the feature/PARQUET-1798-field-id-assignment branch from 49c3060 to bd4a8fb Compare May 10, 2021 19:57
@pitrou
Copy link
Member

pitrou commented May 12, 2021

I'm not even sure what field_ids are supposed to be for. The parquet spec only has this to say:

  /** When the original schema supports field ids, this will save the
   * original field id in the parquet schema
   */
  9: optional i32 field_id;

I suppose "original schema" means something non-Parquet, but what? Is it just some kind of arbitrary application-defined id?

cpp/src/parquet/schema.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented May 12, 2021

Based on my understanding, it seems that we should:

  • when reading from Parquet, reflect Parquet field_ids (if any) under the PARQUET:field_id metadata key
  • when writing to Parquet, generate Parquet field_ids from the PARQUET:field_id metadata key (if present)
  • not attempt to auto-generate any field_ids if they are not present in metadata

@emkornfield
Copy link
Contributor

emkornfield commented May 12, 2021

https://issues.apache.org/jira/browse/PARQUET-951 informs field IDs a little bit better. It is from other systems, in this case protobuf (and I imagine thrift might also have something similar) has each field in a message annotated with a unique ID. Based on this I agree with Antoine's assessment, haven't actually looked at the code (is this not what is done?).

@westonpace
Copy link
Member Author

  • not attempt to auto-generate any field_ids if they are not present in metadata
    @pitrou

That should simplify things. Just to clarify, this will be a bit of a regression as we currently auto-generate field IDs today.

https://issues.apache.org/jira/browse/PARQUET-951 informs field IDs a little bit better. It is from other systems, in this case protobuf (and I imagine thrift might also have something similar) has each field in a message annotated with a unique ID. Based on this I agree with Antoine's assessment, haven't actually looked at the code (is this not what is done?).
@emkornfield

Correct. We already pulled the field id out of thrift and into Arrow metadata. The only problem was that the logic to do the reverse was missing. This PR is only adding that.

There could be some follow-up work for integrating with other parts of the Arrow ecosystem. I will send some questions to the ML.

@westonpace westonpace force-pushed the feature/PARQUET-1798-field-id-assignment branch from bd4a8fb to 6b372d0 Compare May 13, 2021 21:27
@westonpace
Copy link
Member Author

Per @pitrou 's suggestion I have removed the logic auto-generating field_id entirely. I also added a python test to ensure things are working full path.

This is ready for review again.

@westonpace westonpace requested a review from pitrou May 13, 2021 21:32
@westonpace westonpace force-pushed the feature/PARQUET-1798-field-id-assignment branch 2 times, most recently from 6e80fe9 to 9349f6e Compare May 17, 2021 18:18
@westonpace
Copy link
Member Author

Ok, looks like I jumped the gun with the last comment. Removing the old auto-generation behavior broke some tests I wasn't looking at. They are fixed now. I believe the CI failures are unrelated at this point. I may force-push tomorrow just for good measure.

Review is welcome.

@westonpace westonpace force-pushed the feature/PARQUET-1798-field-id-assignment branch from c11452d to 85a860b Compare May 24, 2021 16:20
@westonpace
Copy link
Member Author

I've rebased (and verified again that the build failures are unrelated). Gentle ping for review @emkornfield / @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, some comments and questions below.

# }
# optional binary field_id=5 f2;
# }

field_name = b'PARQUET:field_id'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have it named field_id above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed to use the existing variable.

@@ -171,17 +167,16 @@ TEST_F(TestPrimitiveNode, Attrs) {
}

TEST_F(TestPrimitiveNode, FromParquet) {
SchemaElement elt =
NewPrimitive(name_, FieldRepetitionType::OPTIONAL, Type::INT32, field_id_);
SchemaElement elt = NewPrimitive(name_, FieldRepetitionType::OPTIONAL, Type::INT32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this change in the tests. The user should still be able to pass an explicit field_id when creating a schema node, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the user does not need this capability. See comment below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, let's say I want to create a Node with a given field id as was done in this test. Would you I do that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see your point. These tests were not testing FromParquet so the field id setting was still valid. I have restored them.

cpp/src/parquet/schema.h Show resolved Hide resolved
TEST_F(TestConvertRoundTrip, GroupIdMissingIfNotSpecified) {
std::vector<std::shared_ptr<Field>> arrow_fields;
arrow_fields.push_back(::arrow::field("simple", ::arrow::int32(), false));
/// { "nested": { "outer": { "inner" }, "sibling" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing closing brace here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return field_ids;
}

TEST_F(TestConvertRoundTrip, GroupIdMissingIfNotSpecified) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "GroupId"? Should this be "FieldId"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not sure where Group came from. Fixed.

return ::arrow::key_value_metadata({"PARQUET:field_id"}, {std::to_string(field_id)});
}

TEST_F(TestConvertRoundTrip, GroupIdPreserveExisting) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... this test only checks that Arrow metadata is preserved, right? It doesn't test that the metadata is actually converted into a field_id on the Parquet schema node.

Copy link
Member Author

@westonpace westonpace May 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps TestConvertRoundTrip::ConvertSchema should be renamed to RoundTripSchema. It does the following transformations...

  • vector<Field> -> arrow::Schema
  • arrow::Schema -> parquet::SchemaDescriptor
  • parquet::SchemaDescriptor -> vector<parquet::format::SchemaElement>
  • vector<parquet::format::SchemaElement> -> parquet::SchemaDescriptor
  • parquet::SchemaDescriptor -> arrow::Schema

So I believe it does indeed test the Parquet schema node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to test that the PARQUET:field_id Arrow annotation ends up in the Parquet field_id member, does it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now verify the field IDs at all three levels (round-tripped arrow, parquet, and thrift).

…metadata that I had missed. Now that the old behavior changed it was invalid. So I removed my new test and updated the existing test.
@westonpace westonpace force-pushed the feature/PARQUET-1798-field-id-assignment branch from 85a860b to 7a93325 Compare May 27, 2021 13:28
@westonpace
Copy link
Member Author

As an extra level of sanity-checking I created a parquet file with Arrow and then read it in with fastparquet and verified the field_id is correct (both for a missing field_id and a valid field_id).

@westonpace
Copy link
Member Author

Provided this passes CI (I'll check in the morning) I believe I have addressed all concerns.

@pitrou
Copy link
Member

pitrou commented May 27, 2021

Thanks a lot for checking!

@pitrou
Copy link
Member

pitrou commented May 27, 2021

@pitrou pitrou closed this in d0de88d May 27, 2021
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…_id's

Questions:

- This is my first PR in the parquet namespace, I'm not sure of all the special rules.
- The field ID generation doesn't happen on the `parquet::schema` -> `arrow::schema` phase but on the `parquet::format::schema` -> `parquet::schema` phase.  So in order to test I had to add `#include "generated/parquet_types.h"` to `arrow_schema_test.cc` and I wasn't sure if I was allowed to reference the `generated/*` files like that.
- This PR simply allows user specified field id's to be persisted.  Is that sufficient for PARQUET-1798 (the title is rather general) or should I open up a dedicated JIRA?

Closes apache#10289 from westonpace/feature/PARQUET-1798-field-id-assignment

Lead-authored-by: Weston Pace <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@westonpace westonpace deleted the feature/PARQUET-1798-field-id-assignment branch January 6, 2022 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants