[DISCUSSION] Parquet Metadata Improvements #6129

alamb · 2024-07-26T12:59:11Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As we work on various features of Parquet metadata it is becoming clear that working with the current code organization is challenging.

I just wanted to write down some of my thoughts about how it all fits together

Here are some challenges:

The naming is challenging Consistent naming for Parquet page index structures #6097
There is no way to easily write to bytes outside the context of a parquet file: Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000
It is complicated to understand how to read optional parts of the metadata that are not inlined (e.g. OffsetIndexes) - Document when the ParquetRecordBatchReader will re-read metadata #5887
If we ever wanted to speed up (e.g. Use custom thrift decoder to improve speed of parsing parquet metadata #5854) it would be hard with the current structure
There is not always a 1-1 correspondence between file::metadata and the thrift structures in format::metadata,

Describe the solution you'd like
I would like to propose

We continue to clarify the distinction between file::metadata and format::metadata
Improve the API to translate back and forth between them and bytes and de-emphasize the conversion between thrift structures

Maybe this is clear to others but it is not to me

Here is how I see the structures involved:

                                ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐               ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    
                                  ┌──────────────┐                         ┌───────────────────────┐ │   
                                │ │ ColumnIndex  │        │               ││    ParquetMetaData    │     
                                  └──────────────┘                         └───────────────────────┘ │   
  ┌──────────────┐              │ ┌────────────────┐      │               │┌───────────────────────┐     
  │   ..0x24..   │  ◀────────▶    │  OffsetIndex   │          ◀────────▶   │    ParquetMetaData    │ │   
  └──────────────┘              │ └────────────────┘      │               │└───────────────────────┘     
                                           ...                                       ...             │   
                                │ ┌──────────────────┐    │               │ ┌──────────────────┐         
bytes                             │  FileMetaData*   │                      │  FileMetaData*   │     │   
(thrift encoded)                │ └──────────────────┘    │               │ └──────────────────┘         
                                 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   
                                                                                                         
                                     format::meta structures               file::metadata structures         
                                                                                                         
                                                                                                         
                                                     * Same name, different struct

I would like to focus on improving the API for going back/forth between bytes and the file::metadata structures

                                                  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─    
                                                   ┌───────────────────────┐ │   
┌──────────────┐                                  ││    ParquetMetaData    │     
│   ..0x24..   │           ◀────────▶              └───────────────────────┘ │   
└──────────────┘                                  │┌───────────────────────┐     
                                                   │    ParquetMetaData    │ │   
                        Would like to focus       │└───────────────────────┘     
 bytes                  on this API to/from                                  │   
 (thrift encoded)       bytes and the             │ ┌──────────────────┐         
                        file::metadata              │  FileMetaData*   │     │   
                                                  │ └──────────────────┘         
                                                   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘   
                                                                                 
                                                   file::metadata structures

Describe alternatives you've considered
I think we probably need at least two different APIs:

Reading

One that writes to [u8] buffered in memory ( decode_footer and decode_metadata)
One that reads from an AsyncReader or something equivalent (MetadataLoader is enough / needs some more information)

Writing

Writes to [u8] API for encoding/decoding ParquetMetadata with more control #6002)
Writes to an AsyncWriter perhaps

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-08-02T20:29:24Z

Here is a down payment for documentation #6184

jp0317 · 2024-08-09T22:52:20Z

Thanks @alamb! IMHO the current file::metadata also lacks some features that might be helpful (and are available in c++ implementation). For instance, getting a complete Thrift-serialized representation of the FileMetaData, finding the index given the column dot-string path, finding the parent of a field.

EDITED: I just saw #6197 which feels relevant to the 1st one. For the last one, there's a Type struct in the codes which seems similar to the C++ Node. I'm currently am not sure how complex or whether it worths the effort to support a field-tree with parent info in the current codes? But a simple way might be maintaining a Vec<Option<TypePtr>> that marks the parent of each Type, while adding an index to each Type?

alamb · 2024-08-13T21:48:47Z

EDITED: I just saw #6197 which feels relevant to the 1st one.

I agree. We are also looking for help with the reading portion -- see comments on #6002 cc @adriangb

finding the index given the column dot-string path,

There is something similar here https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_column.html but adding a real API that handles the field resolution logic for nested fields would be very nice. Perhaps you can file a ticket requesting this feature (I have found clearly worded tickets are very often picked up by people in this community)

For the last one, there's a Type struct in the codes which seems similar to the C++ Node. I'm currently am not sure how complex or whether it worths the effort to support a field-tree with parent info in the current codes? But a simple way might be maintaining a Vec<Option> that marks the parent of each Type, while adding an index to each Type?

I am not familiar with the usecase for finding the parent of a field so I don't have much to add to this

alamb added enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate labels Jul 26, 2024

alamb mentioned this issue Jul 26, 2024

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000

Closed

etseidl mentioned this issue Jul 26, 2024

Use LevelHistogram throughout Parquet metadata #6134

Closed

alamb mentioned this issue Aug 2, 2024

Add (more) Parquet Metadata Documentation #6184

Merged

etseidl mentioned this issue Oct 7, 2024

Consider adding BloomFilter reading support to ParquetMetadataReader #6514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Parquet Metadata Improvements #6129

[DISCUSSION] Parquet Metadata Improvements #6129

alamb commented Jul 26, 2024

alamb commented Aug 2, 2024

jp0317 commented Aug 9, 2024 •

edited

Loading

alamb commented Aug 13, 2024

[DISCUSSION] Parquet Metadata Improvements #6129

[DISCUSSION] Parquet Metadata Improvements #6129

Comments

alamb commented Jul 26, 2024

Reading

Writing

alamb commented Aug 2, 2024

jp0317 commented Aug 9, 2024 • edited Loading

alamb commented Aug 13, 2024

jp0317 commented Aug 9, 2024 •

edited

Loading