-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6678: [C++][Parquet] Binary data stored in Parquet metadata must be base64-encoded to be UTF-8 compliant #5493
Conversation
It seems that Boost provides base64 related iterators:
Can we use them? Or should we bundle new base64 library? FYI: It seems that Boost bundles the same bse64 library since 1.66: https://github.com/boostorg/beast/blob/develop/include/boost/beast/core/detail/base64.ipp |
I think the strategy here is a stopgap so that we can release. I think we are trying to depend less in general on Boost so either way vendoring a base64 implementation may be a good idea |
I see. |
This needs exports for MSVC. Adding them |
Here's running builds Appveyor: https://ci.appveyor.com/project/wesm/arrow/builds/27646405 Can be merged after a bit more of the builds run if deemed acceptable |
in a product, an acknowledgment in the product documentation would be | ||
appreciated but is not required. | ||
|
||
2. Altered source versions must be plainly marked as such, and must not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the source code hasn't been modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unmodified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be pedantic, the only difference is adding the arrow::util
namespace.
All C++ related builds have passed. I think this can merged. As follow-ups it would be nice to have:
|
|
||
while (in_len-- && ( encoded_string[in_] != '=') && is_base64(encoded_string[in_])) { | ||
char_array_4[i++] = encoded_string[in_]; in_++; | ||
if (i ==4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit space is off here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case in the source file https://github.com/ReneNyffenegger/cpp-base64/blob/master/base64.cpp#L97
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(we don't clang-format the files in vendored/*)
Whether UTF-8 is validated may depend on the Thrift library. It was only caught incidentally in a Python-based Thrift decoder because Python did |
+1, thanks. |
I have added a simple base64 implementation (Zlib license) to arrow/vendored from
https://github.com/ReneNyffenegger/cpp-base64