Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report / blog on parquet metadata sizes for "large" (1000+) numbers of columns #5770

Closed
Tracked by #5853
alamb opened this issue May 16, 2024 · 38 comments
Closed
Tracked by #5853
Assignees
Labels
arrow Changes to the arrow crate documentation Improvements or additions to documentation enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented May 16, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several proposals for remedying perceived issues with parquet which generally propose new formats. For example Lance V2 and Nimble

One of the technical challenges raised about Parquet is that the metadata is encoded such that the entire footer must be read and decoded prior to reading any data.

As the numer of columns increases, the argument goes, the size of the parquet metadata increases beyond the ~8MB sweet spot for a single object store request as well as requiring substantial CPU to decode

However, my theory is that the reason that parquet metadata is typically so large for schemas with many columns is the embedded min/max statistical values for columns / pages

Describe the solution you'd like
I would like to gather data on parquet footer metadata size as a function of:

  1. The number of columns
  2. The number of row groups
  3. if Statistics are enabled / disabled

And then report this in a blog with some sort of conclusion about how well parquet can handle large schemas

Bonus points if we can also measure in memory size (though this will of course vary from implementation to implementation)

Describe alternatives you've considered

Additional context
Related discussion with @wesm on twitter: https://twitter.com/wesmckinn/status/1790884370603024826

Cited issue apache/arrow#39676

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label May 16, 2024
@alamb
Copy link
Contributor Author

alamb commented May 16, 2024

I have hopes that I will find time to work on this next week but am not sure

If anyone knows of prior art in this area (measuring the actual metadata size) it would be great if they linked it

@alamb
Copy link
Contributor Author

alamb commented May 16, 2024

One idea would be to create a dataset with 1000 columns, 2000 columns , 5000 columns and 10000 columns with say 10 row groups of 1M rows

Measure the size of the metadata with

  1. default writer settings
  2. "minimal" settings (no statistics, disable everything possible)

@westonpace
Copy link
Member

westonpace commented May 16, 2024

10K columns by 10 row groups by 1M rows is 100B values (400GB with int32). I don't think anyone has data like that (this is presumptuous, I am probably wrong).

My experience has been either:

  • The files they make are much smaller (and thus not enough or undersized row groups) E.g. financial data where ticker is column.
  • The columns are very sparse (and thus a need for better sparse encoding). E.g. feature stores.

@westonpace
Copy link
Member

Also, I did do some experimentation here (what feels like a long time ago now but was at least a year ago). Two things I observed:

  • The in-memory representation of the parquet metadata seemed way too high, orders of magnitude greater than the on-disk size of the metadata. I assume this was specific to the C++ impl and there was just something I was missing but it was pretty significant even with relatively reasonably-sized files (e.g. 50MiB+ per file).
  • The parquet-c++ library (and, to a lesser degree, arrow and arrow datasets) were not written with large #'s of columns in mind. So even if the format itself could work, the implementation ends up being O(N^2) in a lot of places where it could presumably do a lot better.

I realize this second point is kind of what you are trying to show. I'd be interested in measurements of the first point, at least enough to prove I was missing something (or this problem is unique to parquet-c++)

@mapleFU
Copy link
Member

mapleFU commented May 16, 2024

I don't have specific data by now. At my use case, usally row-group would be limit by size rather than row-count, and we would disable statistics for most of these column. And only write page index without column index in this case

@alamb
Copy link
Contributor Author

alamb commented May 16, 2024

At my use case, usally row-group would be limit by size rather than row-count, and we would disable statistics for most of these column. And only write page index without column index in this case

Yes, this is what I would expect best practice to be for large schemas (especially if the actual data is sparse as @westonpace says). In general I would expect the value for query engines of min/max statistics in parquet metadata to be quite low for columns where the data isn't sorted or clustered in some way.

Thus, if the default parquet writer settings always write this metadata for all columns maybe we need to adjust the default and docuemntation or something

@tustvold
Copy link
Contributor

It may also be worth pointing out that recently support was added to truncate column statistics, page indexes have always supported this, and spark was actually already doing this for columns statistics, with dubious correctness properties. The net result though, is that even is you do write column statistics for all columns, they need not be very large

@tustvold
Copy link
Contributor

I've also filed #5775 to track something I suggested on the mailing list, and would likely drastically reduce the overheads from metadata parsing.

@tustvold
Copy link
Contributor

tustvold commented May 16, 2024

So I created a toy benchmark (https://github.com/tustvold/arrow-rs/tree/thrift-bench) of a 10,000 column parquet file, with f32 columns, no statistics and 10 row groups.

Default Config

The file metadata for this comes to a chunky 11MB, and it takes the #5777 thrift decoder on 45ms to parse this thrift payload.

If we drop the column count to 1000, the metadata drops to 1MB and parses in ~3.7ms.

If we also drop the row group count to 1, the metadata drops to 100KB and parses in 460 us.

So at a very rough level the cost is 10 bytes and ~450ns per column chunk

Drop ColumnMetadata

Dropping the ColumnMetadata from ColumnChunk drops the size down to 3.6MB and the parsing speed down to 23ms

Drop ColumnChunk

Dropping the columns drops the size down to 109KB and the parsing speed down to 718 us.

At this point we have effectively dropped everything apart from the schema.

Arrow Schema IPC

Now for comparison, I encoded a similar schema using arrow IPC to a flatbuffer. This came to a still pretty chunky 500KB. Validating the offsets in this flatbuffer takes ~1ms, which is longer than decoding the equivalent data from thrift.

Thoughts

  • There isn't anything wrong with using thrift, it is competitive if not faster than flatbuffers (when doing full offset validation)
  • The cost of encoding ColumnChunk dominates and is O(num_row_groups * num_columns)
  • Having a single row group drops the latency down to 2ms even with 10,000 columns. Even if each column only has a single 1MB page, the file will be >10GB even with a single row group

Phrasing the above differently, assuming single page ColumnChunks of 1MB, we can parse the metadata for a column chunk in 450ns. This means for a very large 10GB file we can still parse the metadata within single-digit milliseconds.

@emkornfield
Copy link
Contributor

emkornfield commented May 17, 2024

Nice work. I do think #5770 (comment) is probably the key, that statistics would make things considerably worse, adjusting defaults across implementations probably make sense. I'm not sure what Java is set at but I think C++ will always write all statistics.

@marcin-krystianc
Copy link

10K columns by 10 row groups by 1M rows is 100B values (400GB with int32). I don't think anyone has data like that (this is presumptuous, I am probably wrong).

My experience has been either:

  • The files they make are much smaller (and thus not enough or undersized row groups) E.g. financial data where ticker is column.
  • The columns are very sparse (and thus a need for better sparse encoding). E.g. feature stores.

Hi, we use parquet files with 100 row groups and 50k columns (and this is after the dataset has been split into many individual parquet files). What is worse, our use case is reading individual row groups and only subset of columns.
That makes the cost of reading the entire metadata footer even higher than cost of reading the actual data (because we read entire footer and then read only tiny subset of the actual data).

To deal with the problem we've implemented a tool that stores an index information in a separate file which allows for reading only a necessary subset of metadata. (https://github.com/G-Research/PalletJack)..

@tustvold
Copy link
Contributor

we use parquet files with 100 row groups and 50k columns (and this is after the dataset has been split into many individual parquet files). What is worse, our use case is reading individual row groups and only subset of columns.

How big are these files? At 1MB per page (the recommended size) this would come to 5TB?!

For extremely sparse data, such as feature stores, I can't help wondering if MapArray or similar would be a better way to encode this. You would lose the ability to do projection pushdown in the traditional sense, but maybe this is ok?

@marcin-krystianc
Copy link

How big are these files? At 1MB per page (the recommended size) this would come to 5TB?!

I think file size depends on the number of rows per row group. With 100 row groups, 50k columns and single row per row group the file has 700MB (no compression, default page size, no stats).
We use files with tens of thousands rows per row group, and the files sizes about 1TB.

@tustvold
Copy link
Contributor

 the files sizes about 1TB

Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query

@thinkharderdev
Copy link
Contributor

the files sizes about 1TB

Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query

1TB does seem to be quite large but using smaller files requires reading more metadata (with their associated IO). In out system we write files in the range of 50-300MB and reading metadata was so expensive on large queries (we measured it as taking ~30% of total query processing time in some cases) that we built an entire separate system similar to https://github.com/G-Research/PalletJack to deal with it.

@tustvold
Copy link
Contributor

tustvold commented May 17, 2024

that we built an entire separate system similar

My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense, at least to me, that applications would want to build additional metadata structures over the top of collections of parquet files, that are then optimised for their particular read/write workloads?

@thinkharderdev
Copy link
Contributor

that we built an entire separate system similar

My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense to me that applications would want to build additional metadata structures over the top of collections of parquet files that are optimised for their particular read/write workloads, and that by design these would not be a part of the storage format itself?

I'm not sure that's right. A catalog can do many things and certainly some of those things don't belong in the storage format (eg grouping individual parquet files together into to some logical group relevant to the query or something), but if the catalog is just duplicating directly data from the parquet footer because reading the footer is too expensive that seems like something that should be addressed in the storage format itself.

@tustvold
Copy link
Contributor

tustvold commented May 17, 2024

duplicating directly data from the parquet footer because reading the footer is too expensive

But data locality is extremely important? If you have to scan a load of files only to ascertain they're not of interest, that will be wasteful regardless of how optimal the storage format is? Most catalogs collocate file statistics from across multiple files so that the number of files can be quickly and cheaply whittled down. Only then does it consult those files that haven't been eliminated and perform more granular push down to the row group and page level using the statistics embedded in those files.

Or at least that's the theory...

@thinkharderdev
Copy link
Contributor

duplicating directly data from the parquet footer because reading the footer is too expensive

But data locality is extremely important? If you have to scan a load of files only to ascertain they're not of interest, that will be wasteful regardless of how optimal the storage format is? Most catalogs collocate file statistics from across multiple files so that the number of files can be quickly and cheaply whittled down. Only then does it consult those files that haven't been eliminated and perform more granular push down to the row group and page level using the statistics embedded in those files.

Or at least that's the theory...

Right, but there are two different levels here. You can store statistics that allow you to prune based only on metadata, but what's left you just have to go and scan. And the scanning requires reading the column chunk metadata. So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up.

How much of that cost is related to thrift decoding specifically? I think not much, and the real issue is just IO. But if you could read a file-specific metadata header that allows you to prune IO by reading only the column chunk metadata you actually need (whether it is encoded using thrift or flatbuffers or protobuf or XYZ) then I think that could definitely help a lot.

Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯

@tustvold
Copy link
Contributor

tustvold commented May 17, 2024

Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯

This 1000%, even with the new SSD backed stores latencies are still on the order of milliseconds

So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up

Yeah at that point your only option really is to alter the write path to distribute the data in such a way that queries can eliminate files, no metadata chicanery is going to save you if you have to open 100K files 😅

@thinkharderdev
Copy link
Contributor

Then again, when dealing with object storage doing a bunch of random access reads can be counter-productive so ¯_(ツ)_/¯

This 1000%, even with the new SSD backed stores latencies are still on the order of milliseconds

So if after the catalog tells you that you need to scan 100k files with 10k columns each and you are only projecting 5 columns (which pretty well describes what we do for every query), then reading the entire parquet footer for each file to get the 5 columns chunks is going to really add up

Yeah at that point your only option really is to alter the write path to distribute the data in such a way that queries can eliminate files, no metadata chicanery is going to save you if you have to open 100K files 😅

Nimble (the new Meta storage format) has an interesting approach to this. From what I understand they store most of the data that would go in the footer in parquet inline in the column chunk itself. So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about. But they also don't support predicate pushdown and so once you add that in you end with more IOPS to prune data pages (I think)

@mapleFU
Copy link
Member

mapleFU commented May 17, 2024

So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about

This sounds like "drop column chunk metadata" in this section? #5770 (comment) This can be enabled also in standard parquet then?

@thinkharderdev
Copy link
Contributor

So the footer just needs to describe how the actual column chunks are laid out (along with the schema) and since you need to fetch the column chunk anyway you can avoid reading all the column-chunk specific stuff for columns you don't care about

This sounds like "drop column chunk metadata" in this section? #5770 (comment) This can be enabled also in standard parquet then?

I don't think so because you need the column chunk metadata to actually read the column.

@mapleFU
Copy link
Member

mapleFU commented May 17, 2024

I don't think so because you need the column chunk metadata to actually read the column.

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863-L867 ColumnChunk is neccessary, but ColumnChunkMeta is optional. This might making io-estimating a bit tricky but it making ColumnChunkMeta not in footer? And you're right because we need the column chunk metadata to actually read the column.

@thinkharderdev
Copy link
Contributor

I don't think so because you need the column chunk metadata to actually read the column.

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863-L867 ColumnChunk is neccessary, but ColumnChunkMeta is optional. This might making io-estimating a bit tricky but it making ColumnChunkMeta not in footer? And you're right because we need the column chunk metadata to actually read the column.

Yeah, you're right and I think I just misunderstood the original comment (it gets confusing between the rust types in parquet and the thrift structs defined by the parquet thrift IDL :)). But I guess you could just have the column chunk as the bare minimum of just an offset and path and then just do decoding purely based on the data page headers.

@tustvold
Copy link
Contributor

I created a thread on the parquet mailing list to get some other perspectives on this topic, as I think it is fundamental to the expectations of parquet's metadata - https://lists.apache.org/thread/lv5gtfsbqhs7nrxclrq97nt50msxdst2

@marcin-krystianc
Copy link

the files sizes about 1TB

Err... Is this a reasonable size for a single parquet file? I'm more accustomed to seeing parquet files on the order of 100MB to single digit GB, with a separate catalog combining multiple files together for query

If the size of entire dataset is tens of TBs then you need to use larger files or you end up having a lot of files (many thousands) which is hard to manage. Also, thinking about te future of ML, it is clear to me that a requirement for even larger datasets is not unrealistic.

@marcin-krystianc
Copy link

that we built an entire separate system similar

My reading of https://github.com/G-Research/PalletJack is it is filling a similar role to a catalog like Hive, Deltalake or iceberg. It makes sense, at least to me, that applications would want to build additional metadata structures over the top of collections of parquet files, that are then optimised for their particular read/write workloads?

PalletJack is on a lower level than catalogs like Hive, Deltalake or Iceberg. It is designed for use with a individual parquet files for a specific use case of "To be able to decode/parse only a minimum amount of parquet metadata to be able to read the requested sample of data from a parquet file".

@tustvold
Copy link
Contributor

tustvold commented May 20, 2024

Catalogs are designed to easily scale to PBs across potentially millions of files, if scalability is the goal an approach tied to single files will inevitably hit a scaling wall

To be able to decode/parse only a minimum amount of parquet metadata to be able to read the requested sample of data from a parquet file

This what catalogs are designed to provide, but can do so across file boundaries, significantly improving performance and scalability

@alamb
Copy link
Contributor Author

alamb commented May 28, 2024

@XiangpengHao has kindly agreed to do some more analysis and a writeup

@XiangpengHao
Copy link
Contributor

I'm happy to experiment with this, please assign me!

@alamb
Copy link
Contributor Author

alamb commented May 28, 2024

@XiangpengHao and @tustvold and @wiedld spoke a little about this, and here was my understanding of the plan:

We will run experiments (perhaps based on @tustvold 's tool from #5770 (comment))

Specifically create these scenario:

  1. Create schemas with 1000 columns, 5k columns, 10k columns 20k columns, (maybe 100k columns )
  2. Schema all floats (model a machine learning usecase),
  3. Write a parquet file with 10M rows, with 1M row row groups
  4. Try with three different writer settings: 1. default (use the rust writer defaults) , 2. turn off all statistics, etc to minimize the size of the metadata, 3. maximum statistics (turn on full statistics, including page level statistics, don't truncate statistics lenghts, etc

The for each scenario, measure:

  1. Size of metadata footer (in bytes)
  2. Time to decode (time to read ParquetMetaData file)

This should result in a table like for each of "metadata size in bytes" and "decode perfor

Writer Properties 1k columns 2k coumns 5k columns 10k columns 20k columns
Default Properties X X X X X
Minimum Statistics X X X X X
Maximum Statistics X X X X X

Other potential experiments to run:

  • Add string/binary columns to the schema (maybe 10% of the columns) -- I expect to see the metadata be much larger
  • Try other writer settings (e.g. show the effect of metadata truncation)

@alamb
Copy link
Contributor Author

alamb commented Jun 10, 2024

Is anyone willing to review a draft blog post on this subject? Perhaps @jhorstmann ?

If so please email me at [email protected]

@alamb
Copy link
Contributor Author

alamb commented Jun 19, 2024

@alamb alamb closed this as completed Jun 19, 2024
@alamb alamb added the documentation Improvements or additions to documentation label Jul 2, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2024

label_issue.py automatically added labels {'documentation'} from #5863

@alamb alamb added the parquet Changes to the parquet crate label Jul 2, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2024

label_issue.py automatically added labels {'parquet'} from #5863

@alamb alamb added the arrow Changes to the arrow crate label Jul 2, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2024

label_issue.py automatically added labels {'arrow'} from #5798

@adriangb
Copy link
Contributor

I'll point out that we've had good results by thinking about what sorts of queries we want to run on our data and only storing stats for those columns. It's a trivial optimization, similar to not creating an index for every column in a OLTP database, but I don't see it discussed / a lot of high level writers don't give you that choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate documentation Improvements or additions to documentation enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

9 participants