Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant #5493

Closed
wants to merge 4 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Sep 24, 2019

I have added a simple base64 implementation (Zlib license) to arrow/vendored from

https://github.com/ReneNyffenegger/cpp-base64

@kou
Copy link
Member

kou commented Sep 24, 2019

@wesm
Copy link
Member Author

wesm commented Sep 25, 2019

I think the strategy here is a stopgap so that we can release. I think we are trying to depend less in general on Boost so either way vendoring a base64 implementation may be a good idea

@kou
Copy link
Member

kou commented Sep 25, 2019

I see.

@wesm
Copy link
Member Author

wesm commented Sep 25, 2019

This needs exports for MSVC. Adding them

@wesm
Copy link
Member Author

wesm commented Sep 25, 2019

Here's running builds

Appveyor: https://ci.appveyor.com/project/wesm/arrow/builds/27646405
Travis CI: https://travis-ci.org/wesm/arrow/builds/589252353

Can be merged after a bit more of the builds run if deemed acceptable

in a product, an acknowledgment in the product documentation would be
appreciated but is not required.

2. Altered source versions must be plainly marked as such, and must not be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the source code hasn't been modified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unmodified

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantic, the only difference is adding the arrow::util namespace.

@emkornfield
Copy link
Contributor

All C++ related builds have passed. I think this can merged.

As follow-ups it would be nice to have:

  1. Tests for forward compatibility if the encoding changes in some way
  2. Possibly adding integration smoke tests (including the one used that found this issue to some place in our CI). Maybe just running the parquet-mr tool?


while (in_len-- && ( encoded_string[in_] != '=') && is_base64(encoded_string[in_])) {
char_array_4[i++] = encoded_string[in_]; in_++;
if (i ==4) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit space is off here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(we don't clang-format the files in vendored/*)

@wesm
Copy link
Member Author

wesm commented Sep 25, 2019

Whether UTF-8 is validated may depend on the Thrift library. It was only caught incidentally in a Python-based Thrift decoder because Python did decode('utf-8') on the bytes and failed

@emkornfield
Copy link
Contributor

+1, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants