-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-67: C++ metadata flatbuffer serialization and data movement to memory maps #28
Conversation
FWIW, I have only good things to say about Flatbuffers (https://github.com/google/flatbuffers) so far. It's only metadata (serializing data itself with flatbuffers would not be appropriate for several reasons), but it's fast and most importantly lightweight (compared with the alternatives, e.g. Thrift). |
|
||
if (FLATBUFFERS_INCLUDE_DIR AND FLATBUFFERS_LIBRARIES) | ||
set(FLATBUFFERS_FOUND TRUE) | ||
get_filename_component( FLATBUFFERS_LIBS ${FLATBUFFERS_LIBRARIES} DIRECTORY ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use PATH instead of DIRECTORY here to support CMake 2.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, fixed
I'm calling a checkpoint here (before the patch gets much larger) as I've got simple integer row batches round-tripping successfully to a memory map. Sorry this got so big; lot of scaffolding needed to make this all work. @StevenMPhillips I wouldn't mind a review of the code in I'll make a pass at incorporating feedback from #16. There's a long list of follow ups to incorporate all aspects of the Arrow spec into the IPC code, so I'll create lots of follow up JIRAs to break things into more bite-sized pieces. |
Incorporated code review comments from #16. |
@@ -7,5 +7,6 @@ SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) | |||
source thirdparty/versions.sh | |||
|
|||
export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR | |||
export FLATBUFFERS_HOME=$SOURCE_DIR/thirdparty/installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this might conflict with #29 as well
No problem, very much appreciate the reviews. I'll address your comments and then rebase/merge so we can move forward. The Cython stuff is all refactoring so no need to scrutinize there. |
I'm mostly going to be working on Python and Parquet-related stuff this week so I'll keep you in the loop on which JIRAs are on my critical path. Note: I'll be on vacation 4/9-21 so not much code progress during that time but I'll be able to do code reviews (within any given 72 hour window) so you don't get blocked waiting on reviews or patches merged. Hope to get a lot of stuff done before 4/9 |
} | ||
|
||
union MessageHeader { | ||
Schema, DictionaryBatch, RecordBatch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a distinction between RecordBatch and RowBatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- I didn't like the name RecordBatch
(Kudu and Impala all use "row batch" to refer to the same concept), but Drill uses "record batch". Will be sure to add comments clarifying
6b88094
to
5a70a52
Compare
Just rebased and resolved conflicts from #29. Going to address all the feedback now and merge with a green build |
Thank you @emkornfield and others for the detailed reviews. I addressed the comments and added a bunch more comments to Message.fbs. Please continue to scrutinize the metadata serialization and let me or @StevenMPhillips know what is unclear. One thing that is not specified at all is transmitting dictionary-encoded data (especially nested data like strings or other complex structures). |
scaffolding / cleaning.
bool Equals(const Array& arr) const; | ||
virtual bool Equals(const std::shared_ptr<Array>& arr) const; | ||
bool EqualsExact(const Array& arr) const; | ||
virtual bool Equals(const std::shared_ptr<Array>& arr) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang revealed that this (plus the downstream PrimitiveArrayImpl templates) was causing some bad behavior, so we can review in the future how to have the cleanest code for the proliferation of array containers while avoiding unwanted template instantiation (I admit I don't know enough about this to make a judgment
+1, thank you all |
List, | ||
Tuple, | ||
Union, | ||
JSONScalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is an alias for Union<Int, ...> it should probably not be in this type union.
Does flatbuff have constant literals?
Author: Wes McKinney <[email protected]> Closes apache#28 from wesm/PARQUET-439 and squashes the following commits: 7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements
Author: Wes McKinney <[email protected]> Closes apache#28 from wesm/PARQUET-439 and squashes the following commits: 7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
Author: Wes McKinney <[email protected]> Closes apache#28 from wesm/PARQUET-439 and squashes the following commits: 7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
Author: Wes McKinney <[email protected]> Closes apache#28 from wesm/PARQUET-439 and squashes the following commits: 7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
Author: Wes McKinney <[email protected]> Closes apache#28 from wesm/PARQUET-439 and squashes the following commits: 7c0712b [Wes McKinney] PARQUET-439: Conform copyright headers to ASF requirements Change-Id: I23c1c61346cc4d617b2124e0a6ad208a24aeb4c5
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point. I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`. ``` $ git log | head -1 commit ed5f534 % ctest ... Start 1: arrow-array-test 1/51 Test #1: arrow-array-test ..................... Passed 4.62 sec Start 2: arrow-buffer-test 2/51 Test #2: arrow-buffer-test .................... Passed 0.14 sec Start 3: arrow-extension-type-test 3/51 Test #3: arrow-extension-type-test ............ Passed 0.12 sec Start 4: arrow-misc-test 4/51 Test #4: arrow-misc-test ...................... Passed 0.14 sec Start 5: arrow-public-api-test 5/51 Test #5: arrow-public-api-test ................ Passed 0.12 sec Start 6: arrow-scalar-test 6/51 Test #6: arrow-scalar-test .................... Passed 0.13 sec Start 7: arrow-type-test 7/51 Test #7: arrow-type-test ...................... Passed 0.14 sec Start 8: arrow-table-test 8/51 Test #8: arrow-table-test ..................... Passed 0.13 sec Start 9: arrow-tensor-test 9/51 Test #9: arrow-tensor-test .................... Passed 0.13 sec Start 10: arrow-sparse-tensor-test 10/51 Test #10: arrow-sparse-tensor-test ............. Passed 0.16 sec Start 11: arrow-stl-test 11/51 Test #11: arrow-stl-test ....................... Passed 0.12 sec Start 12: arrow-concatenate-test 12/51 Test #12: arrow-concatenate-test ............... Passed 0.53 sec Start 13: arrow-diff-test 13/51 Test #13: arrow-diff-test ...................... Passed 1.45 sec Start 14: arrow-c-bridge-test 14/51 Test #14: arrow-c-bridge-test .................. Passed 0.18 sec Start 15: arrow-io-buffered-test 15/51 Test #15: arrow-io-buffered-test ............... Passed 0.20 sec Start 16: arrow-io-compressed-test 16/51 Test #16: arrow-io-compressed-test ............. Passed 3.48 sec Start 17: arrow-io-file-test 17/51 Test #17: arrow-io-file-test ................... Passed 0.74 sec Start 18: arrow-io-hdfs-test 18/51 Test #18: arrow-io-hdfs-test ................... Passed 0.12 sec Start 19: arrow-io-memory-test 19/51 Test #19: arrow-io-memory-test ................. Passed 2.77 sec Start 20: arrow-utility-test 20/51 Test #20: arrow-utility-test ...................***Failed 5.65 sec Start 21: arrow-threading-utility-test 21/51 Test #21: arrow-threading-utility-test ......... Passed 1.34 sec Start 22: arrow-compute-compute-test 22/51 Test #22: arrow-compute-compute-test ........... Passed 0.13 sec Start 23: arrow-compute-boolean-test 23/51 Test #23: arrow-compute-boolean-test ........... Passed 0.15 sec Start 24: arrow-compute-cast-test 24/51 Test #24: arrow-compute-cast-test .............. Passed 0.22 sec Start 25: arrow-compute-hash-test 25/51 Test #25: arrow-compute-hash-test .............. Passed 2.61 sec Start 26: arrow-compute-isin-test 26/51 Test #26: arrow-compute-isin-test .............. Passed 0.81 sec Start 27: arrow-compute-match-test 27/51 Test #27: arrow-compute-match-test ............. Passed 0.40 sec Start 28: arrow-compute-sort-to-indices-test 28/51 Test #28: arrow-compute-sort-to-indices-test ... Passed 3.33 sec Start 29: arrow-compute-nth-to-indices-test 29/51 Test #29: arrow-compute-nth-to-indices-test .... Passed 1.51 sec Start 30: arrow-compute-util-internal-test 30/51 Test #30: arrow-compute-util-internal-test ..... Passed 0.13 sec Start 31: arrow-compute-add-test 31/51 Test #31: arrow-compute-add-test ............... Passed 0.12 sec Start 32: arrow-compute-aggregate-test 32/51 Test #32: arrow-compute-aggregate-test ......... Passed 14.70 sec Start 33: arrow-compute-compare-test 33/51 Test #33: arrow-compute-compare-test ........... Passed 7.96 sec Start 34: arrow-compute-take-test 34/51 Test #34: arrow-compute-take-test .............. Passed 4.80 sec Start 35: arrow-compute-filter-test 35/51 Test #35: arrow-compute-filter-test ............ Passed 8.23 sec Start 36: arrow-dataset-dataset-test 36/51 Test #36: arrow-dataset-dataset-test ........... Passed 0.25 sec Start 37: arrow-dataset-discovery-test 37/51 Test #37: arrow-dataset-discovery-test ......... Passed 0.13 sec Start 38: arrow-dataset-file-ipc-test 38/51 Test #38: arrow-dataset-file-ipc-test .......... Passed 0.21 sec Start 39: arrow-dataset-file-test 39/51 Test #39: arrow-dataset-file-test .............. Passed 0.12 sec Start 40: arrow-dataset-filter-test 40/51 Test #40: arrow-dataset-filter-test ............ Passed 0.16 sec Start 41: arrow-dataset-partition-test 41/51 Test #41: arrow-dataset-partition-test ......... Passed 0.13 sec Start 42: arrow-dataset-scanner-test 42/51 Test #42: arrow-dataset-scanner-test ........... Passed 0.20 sec Start 43: arrow-filesystem-test 43/51 Test #43: arrow-filesystem-test ................ Passed 1.62 sec Start 44: arrow-hdfs-test 44/51 Test #44: arrow-hdfs-test ...................... Passed 0.13 sec Start 45: arrow-feather-test 45/51 Test #45: arrow-feather-test ................... Passed 0.91 sec Start 46: arrow-ipc-read-write-test 46/51 Test #46: arrow-ipc-read-write-test ............ Passed 5.77 sec Start 47: arrow-ipc-json-simple-test 47/51 Test #47: arrow-ipc-json-simple-test ........... Passed 0.16 sec Start 48: arrow-ipc-json-test 48/51 Test #48: arrow-ipc-json-test .................. Passed 0.27 sec Start 49: arrow-json-integration-test 49/51 Test #49: arrow-json-integration-test .......... Passed 0.13 sec Start 50: arrow-json-test 50/51 Test #50: arrow-json-test ...................... Passed 0.26 sec Start 51: arrow-orc-adapter-test 51/51 Test #51: arrow-orc-adapter-test ............... Passed 1.92 sec 98% tests passed, 1 tests failed out of 51 Label Time Summary: arrow-tests = 27.38 sec (27 tests) arrow_compute = 45.11 sec (14 tests) arrow_dataset = 1.21 sec (7 tests) arrow_ipc = 6.20 sec (3 tests) unittest = 79.91 sec (51 tests) Total Test time (real) = 79.99 sec The following tests FAILED: 20 - arrow-utility-test (Failed) Errors while running CTest ``` Closes #7142 from kiszk/ARROW-8754 Authored-by: Kazuaki Ishizaki <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
* apacheGH-36209: [Java] Upgrade Netty due to security vulnerability (apache#36211) Upgrading Netty dependency due to CVE GHSA-6mjq-h674-j845 This also requires a patch to arrow-memory Upgrading Netty, gRPC and Protobuf dependencies Existing tests No **This PR contains a "Critical Fix".** netty-handler SniHandler 16MB allocation The SniHandler can allocate up to 16MB of heap for each channel during the TLS handshake. When the handler or the channel does not have an idle timeout, it can be used to make a TCP server using the SniHandler to allocate 16MB of heap. GHSA-6mjq-h674-j845 * Closes: apache#36209 Authored-by: Bryan Cutler <[email protected]> Signed-off-by: David Li <[email protected]> * Restore jackson version. * Use local based ccache for Mac build instead of sscache. --------- Signed-off-by: David Li <[email protected]> Co-authored-by: Bryan Cutler <[email protected]>
* apacheGH-36209: [Java] Upgrade Netty due to security vulnerability (apache#36211) Upgrading Netty dependency due to CVE GHSA-6mjq-h674-j845 This also requires a patch to arrow-memory Upgrading Netty, gRPC and Protobuf dependencies Existing tests No **This PR contains a "Critical Fix".** netty-handler SniHandler 16MB allocation The SniHandler can allocate up to 16MB of heap for each channel during the TLS handshake. When the handler or the channel does not have an idle timeout, it can be used to make a TCP server using the SniHandler to allocate 16MB of heap. GHSA-6mjq-h674-j845 * Closes: apache#36209 Authored-by: Bryan Cutler <[email protected]> Signed-off-by: David Li <[email protected]> * Restore jackson version. * Use local based ccache for Mac build instead of sscache. --------- Signed-off-by: David Li <[email protected]> Co-authored-by: Bryan Cutler <[email protected]>
Several things here:
Quite a bit of judicious code cleaning and consolidation as part of this. For example, List types are now internally equivalent to a nested type with 1 named child field (versus a struct, which can have any number of child fields).
Associated JIRAs: ARROW-48, ARROW-57, ARROW-58