-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Fast Random Rowgroup Reads #39676
Comments
Thousands for rowgroups is an anti-pattern for laying out data (I understand some customers do it) but it creates exactly this type of performance bottleneck (sometimes this is out of our control though) but we should audit write config parameters to make sure there isn't something that is causing this type of spilling, and yes in general, parquet is not well suited to very large column widths. I think there is a better solution here but given that this touches metadata serialization I'm not sure the appetite in there will be for trying to incorporate metadata that parses faster. In any case format changes they need to be discussed on the parquet mailing list [email protected] |
So your bottleneck is reading metadata and row-group? (Since the statistics would be huge). Firstly I think whether we can reducing the row-groups metadata in writer side. It would be much more easier. Actually I think the idea is great, since decoding metadata would be heavy, but I think the "decode_first_row_group" is so tricky here. If we can do it better (like decoding specific metadata), I think it would be great |
One potential way of doing this could be to reduce the current parquet.thrift to just the metadata needed. I believe in that case it should generate code that will skip over unknown fields (would need to double check if there are specific settings to make this happen). |
@emkornfield @mapleFU So, back to the high-level design for faster reads, there are two parts:
|
To your other comments, yes the giant metadata header is the big bottleneck for large tables. Once there are a lot of rows and columns it can take 100x longer to parse that big header than to read the data (if we are just taking a sample). I think this is because the header gets huge, and, parsing the thrift data is actually quite slow since it needs to be decoded field by field and then recopied. It is probably hard to get the format changed, so my first thought was to only read the metadata for the first rowgroup and use this as a kind of prototype. But, there may be better approaches although I'm not sure what they would be. |
Somewhat related, this article gives a nice overview of where this optimization fits in with ideas like predicate push-down, fast queries, etc. |
I understand why don't read all row-group metadata, but why a "first RowGroup" is read in this experiment? Since we already has schema here: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1116 |
@corwinjoy I think we should likely address a issues here before proceeding to an implementation:
|
@emkornfield @mapleFU
|
Points from the profiling session:
|
@emkornfield wrote:
see above
I'm not sure how much we can reduce this without changing the parquet spec. My main argument is that I think that reading all the rowgroups (and some of the other metadata) is simply unnecessary to retrieve the data.
The PR listed here is fine as an interface. It suffers from the same problem as the benchmarks presented here. Opening the file still has to read the full metadata before accessing rowgroups and that can be super-expensive. The kind of optimization presented here would provide internals to avoid reading the full metadata but still be able to access rowgroup data. |
@mapleFU wrote:
I like this idea and I think it has the potential to be even faster than what I have done in the PR. To be specific
So the second item read is actually this schema which doesn't even show up in the profile so I think it may be quite fast. As to why I am using RowGroups the data readers seem to be intimately tied up with using the RowGroup metadata
|
This would certainly benefit out use case which is a dataset consisting of many 1000s of columns and a few billion rows (with row groups of a few GB in size). When reading a specific row group, the time to read metadata can be a significant fraction of the time |
@corwinjoy I get what you'd like to do, i'll go through this patch this week. The main issue is that hacking the thrift generated data structure is a bit hacking 🤔, especially when we only want "first row group"... Besides, could you share a similiar file in your test case( with mocked data ), so I can trying to reproduce the problem? At least would i know the file size, row count, column count, and row-group count? |
@mapleFU Thanks for taking a look!
In both cases, I think the function would need to return after reading the first row group since we can't safely skip bytes. In terms of providing a test file, the new unit tests in
To be consistent with the other tests I am using the test data directory so you will need to set the test data environment variable, e.g. |
Related: #41761 |
Thanks @pitrou |
Describe the enhancement requested
Background:
For parquet files that have a large number of rowgroups and columns, reading the full file metadata is prohibitively expensive when you just want a sample from a table. (Our customers are using parquet files via Arrow which contains > 10k columns and thousands of rowgroups). For the case where you just want to read a few rowgroups and/or columns we would like to have a fast random access reader.
Idea:
Read only the minimal metadata from the parquet file to establish columns and column types. Require that the file contain an OffsetIndex section and use the offset index to directly access the required data pages and columns. Preliminary work indicates that this can give a 2x or 3x speedup with even a modest number of columns and rowgroups with the existing parquet format. With some minor parquet format changes, I believe this could be 100x faster.
Related Work:
There has been some similar work done in this direction, but I think this is more at the interface level rather than direct performance tuning:
[C++][Parquet] Support read by row ranges #39392
[C++][Parquet] support passing a RowRange to RecordBatchReader #38865
Jira: Selective reading of rows for parquet file
And a previous discussion around this with additional benchmarks:
#38149
Having a fast random access reader would also be beneficial for fast reading of a file with predicate pushdowns or other applications where specific rows and columns are desired.
Component(s)
C++
The text was updated successfully, but these errors were encountered: