-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add struct support to parquet writer #7461
Add struct support to parquet writer #7461
Conversation
…arquet-writer-col-device-view
…arquet-writer-col-device-view
…arquet-writer-col-device-view
…arquet-writer-col-device-view
…arquet-writer-col-device-view
Avoid early exit for empty table
…rquet_column_view ctor
… into parquet-writer-struct-schema
…arquet-writer-col-device-view
…arquet-writer-col-device-view
…arquet-writer-col-device-view
… into parquet-writer-struct-schema
Readable from pyarrow but not from cudf because cudf needs unique child names in order to have a unique path in schema
Parent column device view is now always set in EncColDesc. Need new method to distinguish list type columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last of it. I've done it! 🎉
Great stuff :) Got a batch of minor suggestions, nothing serious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just like to see some more comments on the -2 and -3 magic numbers. Looks good otherwise.
Do you mean in the code? or as a reply to the review? |
In the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, else LGTM
@@ -215,20 +215,25 @@ struct ColumnChunkDesc { | |||
/** | |||
* @brief Struct describing an encoder column | |||
*/ | |||
struct EncColumnDesc : stats_column_desc { | |||
struct parquet_column_device_view : stats_column_desc { | |||
uint32_t *dict_index; //!< Dictionary index [row] | |||
uint32_t *dict_data; //!< Dictionary data (unique row indices) | |||
uint8_t physical_type; //!< physical data type | |||
uint8_t converted_type; //!< logical data type | |||
// TODO (dm): Evaluate if this is sufficient. At 4 bits, this allows a maximum 16 level nesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we re-evaluate this now? Is it really worth the bit operations and the 16 nesting depth limit to smash this together into level_bits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw this after merging. But the comment is wrong. 16 is not the nesting depth but the bit size of the nesting depth, which aligns with parquet apache reference impl. Our limit is slightly lower at 8 bits (max 255 level nesting) because the algorithm would have a really bad performance for large nesting.
I agree this is an over optimization. I'll remove it the next time I touch it.
@gpucibot merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and is a nice step in the correct direction. Thanks for all this hard work, Devavret.
for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) { | ||
children.push_back(std::make_shared<linked_column_view>(this, *child_it)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) { | |
children.push_back(std::make_shared<linked_column_view>(this, *child_it)); | |
} | |
std::transform(col.child_begin(), col.child_end(), std::back_inserter(children), [this](auto child_it) { return std::make_shared<linked_column_view>(this, *child_it); |
for (auto child_it = col.child_begin(); child_it < col.child_end(); ++child_it) { | ||
children.push_back(std::make_shared<linked_column_view>(this, *child_it)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to use std::transform()
here?
for (column_view const &col : table) { | ||
result.emplace_back(std::make_shared<linked_column_view>(col)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree about the other cases, I've found range based for loops to be cleaner and easier to understand personally. Also helps in debugging and checking which element caused the breakage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's hard to justify transform
when a range loop is as clean as this one. I'm not sure what the guideline should be.
// Construct single inheritance column_view from linked_column_view | ||
auto curr_col = schema_node.leaf_column.get(); | ||
column_view single_inheritance_cudf_col = *curr_col; | ||
while (curr_col->parent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We marched down to build the schema and we march back up to write it? Why not store the top-most parent for the leaf node so we don't have to march it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not writing the schema here. parquet_column_view
's ctor is only going to read the schema and get information that's useful while encoding the data.
The top most parent still contains all its children. We wouldn't be able to get to this leaf. Here, we're using the leaf to march up and convert it to the single child format. So the parent is like this:
S
/ \
i f
^ ^--- Some other column's leaf
⎣__ this column's leaf
By having a pointer to this column's leaf we can create
S
|
i
} | ||
if (curr_schema_node.repetition_type == parquet::REPEATED) { ++max_rep_level; } | ||
curr_schema_node = schema_tree[curr_schema_node.parent_idx]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wouldn't a schema node contain this information so we don't have to march the schema tree multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the point we get to this ctor, the schema has already been generated using the input table and metadata. This ctor only reads the generated schema to get info for writing.
while (curr_schema_node.parent_idx != -1) { | ||
if (not curr_schema_node.is_stub()) { | ||
r_nullability.push_back(curr_schema_node.repetition_type == FieldRepetitionType::OPTIONAL); | ||
} | ||
curr_schema_node = schema_tree[curr_schema_node.parent_idx]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be merged with the loop above if it isn't rolled into the schema data.
} | ||
return nbits; | ||
}; | ||
desc.level_bits = count_bits(max_rep_level()) << 4 | count_bits(max_def_level()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a CUDF_EXPECTS here to verify these values will fit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They definitely will. There is a CUDF_EXPECTS
in the ctor that checks that the max rep/def level < 256. That means it fits in 8 bits. And the number 8 definitely fits into one nibble.
// Mass allocation of column_device_views for each parquet_column_view | ||
std::vector<column_view> cudf_cols; | ||
cudf_cols.reserve(parquet_columns.size()); | ||
for (auto const &parq_col : parquet_columns) { cudf_cols.push_back(parq_col.cudf_column_view()); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform?
Adds struct writing ability to parquet writer.
The internals of the writer have been changed in the following way:
parquet_column_view
from the cudf columns and the input options and used it to construct schema. Now we construct schema directly from the input cudf columns and the input options.struct<int, float>
column is converted into two columns:struct<int>
,struct<float>
. Each of these columns result in a separateparquet_column_view
which is used only for encoding.table_input_metadata
.Breaking change: Input metadata
The new input metadata class
table_input_metadata
contains a vector ofcolumn_in_metadata
which contains a vector ofcolumn_in_metadata
, one for each child of the input column. It can be constructed using the input table and then specific options can be changed for each level.For a table with a single struct column
We can set the per level names and optional nullability as follows:
Related issues
Closes #6989
Closes #6816
Strangely, there isn't an issue asking for struct writing support.