Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add struct support to parquet writer #7461

Merged
merged 105 commits into from
Mar 19, 2021

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Feb 26, 2021

Adds struct writing ability to parquet writer.

The internals of the writer have been changed in the following way:

  • Previously we would construct parquet_column_view from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options.
  • The constructed schema is used to generate views of cudf columns which have a single child hierarchy. e.g. One struct<int, float> column is converted into two columns: struct<int>, struct<float>. Each of these columns result in a separate parquet_column_view which is used only for encoding.
  • In order to allow finer control to the user about the per-column options, the old metadata class is replaced by table_input_metadata.

Breaking change: Input metadata

The new input metadata class table_input_metadata contains a vector of column_in_metadata which contains a vector of column_in_metadata, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level.

For a table with a single struct column

Struct<is_human:bool (non-nullable),
       Struct<weight:float>,
              age:int
             > (nullable)
      > (non-nullable)

We can set the per level names and optional nullability as follows:

cudf::io::table_input_metadata metadata(table);
metadata.column_metadata[0].set_name("being").set_nullability(false);
metadata.column_metadata[0].child(0).set_name("human?").set_nullability(false);
metadata.column_metadata[0].child(1).set_name("particulars");
metadata.column_metadata[0].child(1).child(0).set_name("weight");
metadata.column_metadata[0].child(1).child(1).set_name("age");

Related issues

Closes #6989
Closes #6816
Strangely, there isn't an issue asking for struct writing support.

kaatish and others added 30 commits January 7, 2021 14:26
Avoid early exit for empty table
Readable from pyarrow but not from cudf because cudf needs unique child
names in order to have a unique path in schema
Parent column device view is now always set in EncColDesc.
Need new method to distinguish list type columns.
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last of it. I've done it! 🎉
Great stuff :) Got a batch of minor suggestions, nothing serious.

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just like to see some more comments on the -2 and -3 magic numbers. Looks good otherwise.

@devavret
Copy link
Contributor Author

Would just like to see some more comments on the -2 and -3 magic numbers. Looks good otherwise.

Do you mean in the code? or as a reply to the review?

@nvdbaranec
Copy link
Contributor

In the code.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, else LGTM

python/cudf/cudf/_lib/parquet.pyx Outdated Show resolved Hide resolved
@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 18, 2021
@@ -215,20 +215,25 @@ struct ColumnChunkDesc {
/**
* @brief Struct describing an encoder column
*/
struct EncColumnDesc : stats_column_desc {
struct parquet_column_device_view : stats_column_desc {
uint32_t *dict_index; //!< Dictionary index [row]
uint32_t *dict_data; //!< Dictionary data (unique row indices)
uint8_t physical_type; //!< physical data type
uint8_t converted_type; //!< logical data type
// TODO (dm): Evaluate if this is sufficient. At 4 bits, this allows a maximum 16 level nesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we re-evaluate this now? Is it really worth the bit operations and the 16 nesting depth limit to smash this together into level_bits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this after merging. But the comment is wrong. 16 is not the nesting depth but the bit size of the nesting depth, which aligns with parquet apache reference impl. Our limit is slightly lower at 8 bits (max 255 level nesting) because the algorithm would have a really bad performance for large nesting.

I agree this is an over optimization. I'll remove it the next time I touch it.

@devavret
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit bd29a92 into rapidsai:branch-0.19 Mar 19, 2021
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and is a nice step in the correct direction. Thanks for all this hard work, Devavret.

Comment on lines +91 to 93
for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) {
children.push_back(std::make_shared<linked_column_view>(this, *child_it));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) {
children.push_back(std::make_shared<linked_column_view>(this, *child_it));
}
std::transform(col.child_begin(), col.child_end(), std::back_inserter(children), [this](auto child_it) { return std::make_shared<linked_column_view>(this, *child_it);

Comment on lines +99 to 101
for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) {
children.push_back(std::make_shared<linked_column_view>(this, *child_it));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to use std::transform() here?

Comment on lines +117 to +119
for (column_view const &col : table) {
result.emplace_back(std::make_shared<linked_column_view>(col));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::transform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree about the other cases, I've found range based for loops to be cleaner and easier to understand personally. Also helps in debugging and checking which element caused the breakage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's hard to justify transform when a range loop is as clean as this one. I'm not sure what the guideline should be.

// Construct single inheritance column_view from linked_column_view
auto curr_col = schema_node.leaf_column.get();
column_view single_inheritance_cudf_col = *curr_col;
while (curr_col->parent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We marched down to build the schema and we march back up to write it? Why not store the top-most parent for the leaf node so we don't have to march it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not writing the schema here. parquet_column_view's ctor is only going to read the schema and get information that's useful while encoding the data.

The top most parent still contains all its children. We wouldn't be able to get to this leaf. Here, we're using the leaf to march up and convert it to the single child format. So the parent is like this:

   S
  / \
 i   f
 ^   ^--- Some other column's leaf
 ⎣__ this column's leaf

By having a pointer to this column's leaf we can create

S
|
i

}
if (curr_schema_node.repetition_type == parquet::REPEATED) { ++max_rep_level; }
curr_schema_node = schema_tree[curr_schema_node.parent_idx];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn't a schema node contain this information so we don't have to march the schema tree multiple times?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the point we get to this ctor, the schema has already been generated using the input table and metadata. This ctor only reads the generated schema to get info for writing.

Comment on lines +632 to +637
while (curr_schema_node.parent_idx != -1) {
if (not curr_schema_node.is_stub()) {
r_nullability.push_back(curr_schema_node.repetition_type == FieldRepetitionType::OPTIONAL);
}
curr_schema_node = schema_tree[curr_schema_node.parent_idx];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be merged with the loop above if it isn't rolled into the schema data.

}
return nbits;
};
desc.level_bits = count_bits(max_rep_level()) << 4 | count_bits(max_def_level());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a CUDF_EXPECTS here to verify these values will fit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They definitely will. There is a CUDF_EXPECTS in the ctor that checks that the max rep/def level < 256. That means it fits in 8 bits. And the number 8 definitely fits into one nibble.

// Mass allocation of column_device_views for each parquet_column_view
std::vector<column_view> cudf_cols;
cudf_cols.reserve(parquet_columns.size());
for (auto const &parq_col : parquet_columns) { cudf_cols.push_back(parq_col.cudf_column_view()); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transform?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
7 participants