ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant #5493

wesm · 2019-09-24T23:35:31Z

I have added a simple base64 implementation (Zlib license) to arrow/vendored from

https://github.com/ReneNyffenegger/cpp-base64

…e in Parquet metadata is UTF-8

kou · 2019-09-24T23:59:32Z

It seems that Boost provides base64 related iterators:

Can we use them? Or should we bundle new base64 library?

FYI: It seems that Boost bundles the same bse64 library since 1.66: https://github.com/boostorg/beast/blob/develop/include/boost/beast/core/detail/base64.ipp

wesm · 2019-09-25T01:50:26Z

I think the strategy here is a stopgap so that we can release. I think we are trying to depend less in general on Boost so either way vendoring a base64 implementation may be a good idea

kou · 2019-09-25T01:55:03Z

I see.

wesm · 2019-09-25T02:44:19Z

This needs exports for MSVC. Adding them

wesm · 2019-09-25T03:10:37Z

Here's running builds

Appveyor: https://ci.appveyor.com/project/wesm/arrow/builds/27646405
Travis CI: https://travis-ci.org/wesm/arrow/builds/589252353

Can be merged after a bit more of the builds run if deemed acceptable

emkornfield · 2019-09-25T04:25:34Z

cpp/src/arrow/vendored/base64.cpp

+      in a product, an acknowledgment in the product documentation would be
+      appreciated but is not required.
+
+   2. Altered source versions must be plainly marked as such, and must not be


I assume the source code hasn't been modified?

To be pedantic, the only difference is adding the arrow::util namespace.

emkornfield · 2019-09-25T04:29:24Z

All C++ related builds have passed. I think this can merged.

As follow-ups it would be nice to have:

Tests for forward compatibility if the encoding changes in some way
Possibly adding integration smoke tests (including the one used that found this issue to some place in our CI). Maybe just running the parquet-mr tool?

emkornfield · 2019-09-25T04:30:30Z

cpp/src/arrow/vendored/base64.cpp

+
+  while (in_len-- && ( encoded_string[in_] != '=') && is_base64(encoded_string[in_])) {
+    char_array_4[i++] = encoded_string[in_]; in_++;
+    if (i ==4) {


nit space is off here.

This is the case in the source file https://github.com/ReneNyffenegger/cpp-base64/blob/master/base64.cpp#L97

(we don't clang-format the files in vendored/*)

wesm · 2019-09-25T04:30:33Z

Whether UTF-8 is validated may depend on the Thrift library. It was only caught incidentally in a Python-based Thrift decoder because Python did decode('utf-8') on the bytes and failed

emkornfield · 2019-09-25T04:35:26Z

+1, thanks.

wesm added 3 commits September 24, 2019 18:33

Add vendored base64 C++ implementation and ensure that Thrift KeyValu…

b3a584a

…e in Parquet metadata is UTF-8

Fix LICENSE.txt, add iwyu export

eabb121

Fix Python unit test that needs to base64-decode now

06f75cd

Simplify, add MSVC exports

c058e86

emkornfield reviewed Sep 25, 2019

View reviewed changes

emkornfield closed this in 4fe330a Sep 25, 2019

asfimport mentioned this pull request Sep 25, 2019

[C++] Regression in Parquet file compatibility introduced by ARROW-3246 #23026

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant #5493

ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant #5493

wesm commented Sep 24, 2019

kou commented Sep 24, 2019

wesm commented Sep 25, 2019

kou commented Sep 25, 2019

wesm commented Sep 25, 2019

wesm commented Sep 25, 2019

emkornfield Sep 25, 2019

wesm Sep 25, 2019

wesm Sep 25, 2019

emkornfield commented Sep 25, 2019

emkornfield Sep 25, 2019

wesm Sep 25, 2019

wesm Sep 25, 2019

wesm commented Sep 25, 2019

emkornfield commented Sep 25, 2019

ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant #5493

ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant #5493

Conversation

wesm commented Sep 24, 2019

kou commented Sep 24, 2019

wesm commented Sep 25, 2019

kou commented Sep 25, 2019

wesm commented Sep 25, 2019

wesm commented Sep 25, 2019

emkornfield Sep 25, 2019

Choose a reason for hiding this comment

wesm Sep 25, 2019

Choose a reason for hiding this comment

wesm Sep 25, 2019

Choose a reason for hiding this comment

emkornfield commented Sep 25, 2019

emkornfield Sep 25, 2019

Choose a reason for hiding this comment

wesm Sep 25, 2019

Choose a reason for hiding this comment

wesm Sep 25, 2019

Choose a reason for hiding this comment

wesm commented Sep 25, 2019

emkornfield commented Sep 25, 2019