-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Slow column reading from multi-column parquet files #38149
Comments
Does these tests are running on Local FileSystem or an ObjectStore? |
It is all test with local file system. |
Sorry for late replying. Have you solved this problem? When column grows, the metadata will grow. The metadata is thrift, and thrift need to deserialize all data. I must admit this is hard to optimize, because:
Currently I don't know a proper way to solve this. Any idea is welcomed. Advice: maybe cache the deserialized footer is ok in this case? @marcin-krystianc |
Hi @mapleFU, We haven't found any solution to this problem yet, but we are still looking for it.
In our scenario we use files with many thousands of columns and many row groups. The workaround with caching the metadata is an option, but I think it will not work for everyone. |
Would you mind tell the size of a parquet footer in your file? Also we don't need to keep the file open. Just caching the |
In my tests I'm using 20k columns, 10 row groups which gives about 9-10 megabytes of metadata (exactly 9386184 bytes). But in production we use even larger files (although I don't know exact numbers at the moment). |
@marcin-krystianc Logic like PraquetFragment and ParquetDataset might works:
Also, deserialize introduce so many virtual function calls, this can be optimize with some thrift optimizations |
Also you can try to profile the deserialization, lets find should possible optimizations here. I think the current thrift deserialzation might be low performance.. |
That is a quite good workaround, but unfortunately, it will not work for us,
It "feels" to be slow but there are no obvious culprits:
libparquet.so.1500!apache::thrift::transport::TBufferBase::readAll(apache::thrift::transport::TBufferBase * const this, uint8_t * buf, uint32_t len) (\usr\include\thrift\transport\TBufferTransports.h:81)
libparquet.so.1500!apache::thrift::transport::TMemoryBuffer::readAll(apache::thrift::transport::TMemoryBuffer * const this, uint8_t * buf, uint32_t len) (\usr\include\thrift\transport\TBufferTransports.h:696)
libparquet.so.1500!apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>::readByte(apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer> * const this, int8_t & byte) (\usr\include\thrift\protocol\TCompactProtocol.tcc:620)
libparquet.so.1500!apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>::readFieldBegin(apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer> * const this, std::string & name, apache::thrift::protocol::TType & fieldType, int16_t & fieldId) (\usr\include\thrift\protocol\TCompactProtocol.tcc:481)
libparquet.so.1500!apache::thrift::protocol::TVirtualProtocol<apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>, apache::thrift::protocol::TProtocolDefaults>::readFieldBegin_virt(apache::thrift::protocol::TVirtualProtocol<apache::thrift::protocol::TCompactProtocolT<apache::thrift::transport::TMemoryBuffer>, apache::thrift::protocol::TProtocolDefaults> * const this, std::string & name, apache::thrift::protocol::TType & fieldType, int16_t & fieldId) (\usr\include\thrift\protocol\TVirtualProtocol.h:415)
libparquet.so.1500!apache::thrift::protocol::TProtocol::readFieldBegin(apache::thrift::protocol::TProtocol * const this, std::string & name, apache::thrift::protocol::TType & fieldType, int16_t & fieldId) (\usr\include\thrift\protocol\TProtocol.h:423)
libparquet.so.1500!parquet::format::FileMetaData::read(parquet::format::FileMetaData * const this, apache::thrift::protocol::TProtocol * iprot) (\src\arrow\cpp\src\generated\parquet_types.cpp:8011)
libparquet.so.1500!parquet::ThriftDeserializer::DeserializeUnencryptedMessage<parquet::format::FileMetaData>(parquet::ThriftDeserializer * const this, const uint8_t * buf, uint32_t * len, parquet::format::FileMetaData * deserialized_msg) (\src\arrow\cpp\src\parquet\thrift_internal.h:455)
libparquet.so.1500!parquet::ThriftDeserializer::DeserializeMessage<parquet::format::FileMetaData>(parquet::ThriftDeserializer * const this, const uint8_t * buf, uint32_t * len, parquet::format::FileMetaData * deserialized_msg, parquet::Decryptor * decryptor) (\src\arrow\cpp\src\parquet\thrift_internal.h:409)
libparquet.so.1500!parquet::FileMetaData::FileMetaDataImpl::FileMetaDataImpl(parquet::FileMetaData::FileMetaDataImpl * const this, const void * metadata, uint32_t * metadata_len, parquet::ReaderProperties properties, std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) (\src\arrow\cpp\src\parquet\metadata.cc:606)
libparquet.so.1500!parquet::FileMetaData::FileMetaData(parquet::FileMetaData * const this, const void * metadata, uint32_t * metadata_len, const parquet::ReaderProperties & properties, std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) (\src\arrow\cpp\src\parquet\metadata.cc:884)
libparquet.so.1500!parquet::FileMetaData::Make(const void * metadata, uint32_t * metadata_len, const parquet::ReaderProperties & properties, std::shared_ptr<parquet::InternalFileDecryptor> file_decryptor) (\src\arrow\cpp\src\parquet\metadata.cc:871)
libparquet.so.1500!parquet::SerializedFile::ParseUnencryptedFileMetadata(parquet::SerializedFile * const this, const std::shared_ptr<arrow::Buffer> & metadata_buffer, const uint32_t metadata_len) (\src\arrow\cpp\src\parquet\file_reader.cc:626)
libparquet.so.1500!parquet::SerializedFile::ParseMetaData(parquet::SerializedFile * const this) (\src\arrow\cpp\src\parquet\file_reader.cc:444)
libparquet.so.1500!parquet::ParquetFileReader::Contents::Open(std::shared_ptr<arrow::io::RandomAccessFile> source, const parquet::ReaderProperties & props, std::shared_ptr<parquet::FileMetaData> metadata) (\src\arrow\cpp\src\parquet\file_reader.cc:764)
libparquet.so.1500!parquet::ParquetFileReader::Open(std::shared_ptr<arrow::io::RandomAccessFile> source, const parquet::ReaderProperties & props, std::shared_ptr<parquet::FileMetaData> metadata) (\src\arrow\cpp\src\parquet\file_reader.cc:802) |
Describe the bug, including details regarding any error messages, version, and platform.
Describe the bug, including details regarding any error messages, version, and platform.
Hi,
this is related to #38087 but it covers a different problem.
Similar to the previous issue, in our use case, we read some columns (e.g. 100) from a parquet file containing many more columns (e.g. 20k).
The problem is that the more columns are in the file to more time is needed to read a particular column (The repro code: https://github.com/marcin-krystianc/arrow_issue_2023-10-06).
In the graph below(Produced with https://github.com/marcin-krystianc/arrow_issue_2023-10-06/blob/master/plot_results.py), we can clearly see that when we read 100 columns from a parquet file (the orange line), the more columns are in the file the longer it takes to read a single column.
However, when we read the entire file (all columns), then the time to read a single column doesn't depend too much on the number of columns in the file. There is still some correlation but it is much weaker than before.
Both Python and C++ exhibit the same problem, but it is not a surprise since Python delegates the Parquet file reading to C++ anyway.
According to my analysis, there is a simple explanation for the reported problem. Namely, when we create a
FileReader
class, it reads and parses the entire metadata section from the file. Since the metadata section contains information about all columns, it means a lot of that metadata reading and parsing is wasted work in case we read only a tiny fraction of columns from the file.Python code:
C++ code:
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: