Add struct support to parquet writer #7461

devavret · 2021-02-26T20:46:19Z

Adds struct writing ability to parquet writer.

The internals of the writer have been changed in the following way:

Previously we would construct parquet_column_view from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options.
The constructed schema is used to generate views of cudf columns which have a single child hierarchy. e.g. One struct<int, float> column is converted into two columns: struct<int>, struct<float>. Each of these columns result in a separate parquet_column_view which is used only for encoding.
In order to allow finer control to the user about the per-column options, the old metadata class is replaced by table_input_metadata.

Breaking change: Input metadata

The new input metadata class table_input_metadata contains a vector of column_in_metadata which contains a vector of column_in_metadata, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level.

For a table with a single struct column

Struct<is_human:bool (non-nullable),
       Struct<weight:float>,
              age:int
             > (nullable)
      > (non-nullable)

We can set the per level names and optional nullability as follows:

cudf::io::table_input_metadata metadata(table);
metadata.column_metadata[0].set_name("being").set_nullability(false);
metadata.column_metadata[0].child(0).set_name("human?").set_nullability(false);
metadata.column_metadata[0].child(1).set_name("particulars");
metadata.column_metadata[0].child(1).child(0).set_name("weight");
metadata.column_metadata[0].child(1).child(1).set_name("age");

Related issues

Closes #6989
Closes #6816
Strangely, there isn't an issue asking for struct writing support.

…arquet-writer-col-device-view

Avoid early exit for empty table

…rquet_column_view ctor

… into parquet-writer-struct-schema

…arquet-writer-col-device-view

… into parquet-writer-struct-schema

Readable from pyarrow but not from cudf because cudf needs unique child names in order to have a unique path in schema

Parent column device view is now always set in EncColDesc. Need new method to distinguish list type columns.

vuule

This is the last of it. I've done it! 🎉
Great stuff :) Got a batch of minor suggestions, nothing serious.

cpp/src/io/parquet/writer_impl.cu

cpp/tests/io/parquet_test.cpp

cpp/src/io/parquet/writer_impl.cu

nvdbaranec

Would just like to see some more comments on the -2 and -3 magic numbers. Looks good otherwise.

devavret · 2021-03-17T20:42:56Z

Would just like to see some more comments on the -2 and -3 magic numbers. Looks good otherwise.

Do you mean in the code? or as a reply to the review?

nvdbaranec · 2021-03-17T20:44:08Z

In the code.

vuule

🔥 🔥

galipremsagar

Minor comment, else LGTM

python/cudf/cudf/_lib/parquet.pyx

hyperbolic2346 · 2021-03-19T14:07:01Z

cpp/src/io/parquet/parquet_gpu.hpp

@@ -215,20 +215,25 @@ struct ColumnChunkDesc {
 /**
 * @brief Struct describing an encoder column
 */
-struct EncColumnDesc : stats_column_desc {
+struct parquet_column_device_view : stats_column_desc {
  uint32_t *dict_index;    //!< Dictionary index [row]
  uint32_t *dict_data;     //!< Dictionary data (unique row indices)
  uint8_t physical_type;   //!< physical data type
  uint8_t converted_type;  //!< logical data type
  // TODO (dm): Evaluate if this is sufficient. At 4 bits, this allows a maximum 16 level nesting


Should we re-evaluate this now? Is it really worth the bit operations and the 16 nesting depth limit to smash this together into level_bits?

Just saw this after merging. But the comment is wrong. 16 is not the nesting depth but the bit size of the nesting depth, which aligns with parquet apache reference impl. Our limit is slightly lower at 8 bits (max 255 level nesting) because the algorithm would have a really bad performance for large nesting.

I agree this is an over optimization. I'll remove it the next time I touch it.

devavret · 2021-03-19T14:28:42Z

@gpucibot merge

hyperbolic2346

This looks great and is a nice step in the correct direction. Thanks for all this hard work, Devavret.

hyperbolic2346 · 2021-03-19T14:14:54Z

cpp/src/io/parquet/writer_impl.cu

+    for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) {
+      children.push_back(std::make_shared<linked_column_view>(this, *child_it));
    }


Suggested change

for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) {

children.push_back(std::make_shared<linked_column_view>(this, *child_it));

}

std::transform(col.child_begin(), col.child_end(), std::back_inserter(children), [this](auto child_it) { return std::make_shared<linked_column_view>(this, *child_it);

hyperbolic2346 · 2021-03-19T14:15:27Z

cpp/src/io/parquet/writer_impl.cu

+    for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) {
+      children.push_back(std::make_shared<linked_column_view>(this, *child_it));
    }


Is there a reason not to use std::transform() here?

hyperbolic2346 · 2021-03-19T14:16:49Z

cpp/src/io/parquet/writer_impl.cu

+  for (column_view const &col : table) {
+    result.emplace_back(std::make_shared<linked_column_view>(col));
+  }


std::transform?

While I agree about the other cases, I've found range based for loops to be cleaner and easier to understand personally. Also helps in debugging and checking which element caused the breakage.

Yeah, it's hard to justify transform when a range loop is as clean as this one. I'm not sure what the guideline should be.

hyperbolic2346 · 2021-03-19T15:02:16Z

cpp/src/io/parquet/writer_impl.cu

+  // Construct single inheritance column_view from linked_column_view
+  auto curr_col                           = schema_node.leaf_column.get();
+  column_view single_inheritance_cudf_col = *curr_col;
+  while (curr_col->parent) {


We marched down to build the schema and we march back up to write it? Why not store the top-most parent for the leaf node so we don't have to march it again?

We are not writing the schema here. parquet_column_view's ctor is only going to read the schema and get information that's useful while encoding the data.

The top most parent still contains all its children. We wouldn't be able to get to this leaf. Here, we're using the leaf to march up and convert it to the single child format. So the parent is like this:

S / \ i f ^ ^--- Some other column's leaf ⎣__ this column's leaf

By having a pointer to this column's leaf we can create

S | i

hyperbolic2346 · 2021-03-19T15:04:33Z

cpp/src/io/parquet/writer_impl.cu

+    }
+    if (curr_schema_node.repetition_type == parquet::REPEATED) { ++max_rep_level; }
+    curr_schema_node = schema_tree[curr_schema_node.parent_idx];
+  }


why wouldn't a schema node contain this information so we don't have to march the schema tree multiple times?

By the point we get to this ctor, the schema has already been generated using the input table and metadata. This ctor only reads the generated schema to get info for writing.

hyperbolic2346 · 2021-03-19T15:05:45Z

cpp/src/io/parquet/writer_impl.cu

+  while (curr_schema_node.parent_idx != -1) {
+    if (not curr_schema_node.is_stub()) {
+      r_nullability.push_back(curr_schema_node.repetition_type == FieldRepetitionType::OPTIONAL);
+    }
+    curr_schema_node = schema_tree[curr_schema_node.parent_idx];
+  }


This can be merged with the loop above if it isn't rolled into the schema data.

hyperbolic2346 · 2021-03-19T15:12:31Z

cpp/src/io/parquet/writer_impl.cu

+    }
+    return nbits;
+  };
+  desc.level_bits  = count_bits(max_rep_level()) << 4 | count_bits(max_def_level());


Should there be a CUDF_EXPECTS here to verify these values will fit?

They definitely will. There is a CUDF_EXPECTS in the ctor that checks that the max rep/def level < 256. That means it fits in 8 bits. And the number 8 definitely fits into one nibble.

hyperbolic2346 · 2021-03-19T15:14:45Z

cpp/src/io/parquet/writer_impl.cu

+  // Mass allocation of column_device_views for each parquet_column_view
+  std::vector<column_view> cudf_cols;
+  cudf_cols.reserve(parquet_columns.size());
+  for (auto const &parq_col : parquet_columns) { cudf_cols.push_back(parq_col.cudf_column_view()); }


kaatish and others added 30 commits January 7, 2021 14:26

Add column_device_view pointers to EncColumnDesc

26f96bf

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

1f5b6c4

…arquet-writer-col-device-view

Fix compilation

9aba2b5

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

1946fed

…arquet-writer-col-device-view

Replace outdated calls in page_dict

d507132

Add GetDtypeLogicalLen for column_device_view

d286962

PR comment fixes

f52fa95

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

bc4c771

…arquet-writer-col-device-view

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

a5064bb

…arquet-writer-col-device-view

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

e5d02f2

…arquet-writer-col-device-view

Fix tests

d19b8f1

Schema fix

f7658c6

Avoid early exit for empty table

PR review fixes

90b5da0

Built schema and linked_column_view

9d54927

Converting linked_column_view to single inheritance column_view in pa…

0caffde

…rquet_column_view ctor

Merge remote-tracking branch 'kaatish/parquet-writer-col-device-view'…

4ebb4fd

… into parquet-writer-struct-schema

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

d0e2613

…arquet-writer-col-device-view

PR review fixes

57d6afd

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

c2393a8

…arquet-writer-col-device-view

PR review fixes

f5203f5

Plumb new parquet column and schema into write()

c80636a

Merge branch 'branch-0.18' of https://github.com/rapidsai/cudf into p…

9b2070b

…arquet-writer-col-device-view

Format fixes

25301a9

Removed get_mask_offset_word

3522264

PR review fixes

bd0691f

Merge remote-tracking branch 'kaatish/parquet-writer-col-device-view'…

610c6b4

… into parquet-writer-struct-schema

Non-list structs can now be written

b828c31

Readable from pyarrow but not from cudf because cudf needs unique child names in order to have a unique path in schema

Fixed non-nested case.

73e8631

Parent column device view is now always set in EncColDesc. Need new method to distinguish list type columns.

Re-enable non-int96 chrono type support

6b648d4

Added column name to struct

dbfae18

replace list with vector for path in schema and nullability

a7c64f7

devavret mentioned this pull request Mar 16, 2021

[QST] Should cudf support an incoming schema for parquet writing and possibly other formats #6816

Closed

vuule requested changes Mar 16, 2021

View reviewed changes

devavret added 2 commits March 18, 2021 01:43

Review cleanups

075a571

Move long functions out of parquet_column_view

8414464

nvdbaranec reviewed Mar 17, 2021

View reviewed changes

Short explanation of indices used for level

4c2e632

devavret requested review from nvdbaranec and vuule March 17, 2021 22:22

vuule approved these changes Mar 18, 2021

View reviewed changes

galipremsagar reviewed Mar 18, 2021

View reviewed changes

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved

Review fix in cython

491b9c5

nvdbaranec approved these changes Mar 18, 2021

View reviewed changes

vuule requested a review from galipremsagar March 18, 2021 18:30

galipremsagar approved these changes Mar 18, 2021

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 18, 2021

Move metadata copying to writer ctor

401d11e

razajafri mentioned this pull request Mar 18, 2021

Enable StructType for Parquet and InMemoryTableScanExec NVIDIA/spark-rapids#1969

Closed

hyperbolic2346 reviewed Mar 19, 2021

View reviewed changes

rapids-bot bot merged commit bd29a92 into rapidsai:branch-0.19 Mar 19, 2021

hyperbolic2346 reviewed Mar 19, 2021

View reviewed changes

vyasr mentioned this pull request Mar 22, 2021

[REVIEW] Resolve unnecessary import of thrust/optional.hpp in types.hpp #7667

Merged

kaatish mentioned this pull request Apr 12, 2021

parquet_column_view should use type_dispatcher #5565

Closed

kaatish linked an issue Apr 13, 2021 that may be closed by this pull request

parquet_column_view should use type_dispatcher #5565

Closed

vuule mentioned this pull request Sep 17, 2021

Add support for struct type in ORC writer #9025

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add struct support to parquet writer #7461

Add struct support to parquet writer #7461

devavret commented Feb 26, 2021 •

edited

Loading

vuule left a comment

nvdbaranec left a comment

devavret commented Mar 17, 2021

nvdbaranec commented Mar 17, 2021

vuule left a comment

galipremsagar left a comment

hyperbolic2346 Mar 19, 2021

devavret Mar 19, 2021

devavret commented Mar 19, 2021

hyperbolic2346 left a comment

hyperbolic2346 Mar 19, 2021

hyperbolic2346 Mar 19, 2021

hyperbolic2346 Mar 19, 2021

devavret Mar 19, 2021

vuule Mar 19, 2021

hyperbolic2346 Mar 19, 2021

devavret Mar 19, 2021

hyperbolic2346 Mar 19, 2021

devavret Mar 19, 2021

hyperbolic2346 Mar 19, 2021

hyperbolic2346 Mar 19, 2021

devavret Mar 19, 2021

hyperbolic2346 Mar 19, 2021

Add struct support to parquet writer #7461

Add struct support to parquet writer #7461

Conversation

devavret commented Feb 26, 2021 • edited Loading

Adds struct writing ability to parquet writer.

Breaking change: Input metadata

Related issues

vuule left a comment

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

devavret commented Mar 17, 2021

nvdbaranec commented Mar 17, 2021

vuule left a comment

Choose a reason for hiding this comment

galipremsagar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devavret commented Mar 19, 2021

hyperbolic2346 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devavret commented Feb 26, 2021 •

edited

Loading