Optimise parquet read parameters #3928

gaffer01 · 2024-12-13T13:44:36Z

Background

Experimentation has suggested better defaults for parameters used when running queries, specifically the use of column indexes and the S3A readahead range. These optimisations reduce both the number of GETs and the time to return results. The reduction in the number of GETs is dramatic if there are a lot of columns in a table.

Description

When running queries we want to turn off the column indexes. We also want to set the readhead range to the size of the row group.

We can also turn off the use of column indexes when reading parquet files in Java compactions (this has already been done in the DataFusion-based compaction code).

We want to continue to write column indexes when we write parquet files, as external programs may want to read Sleeper's parquet files and use them.

Analysis

To turn off column indexes when running queries we can use the useColumnIndexFilter(false) option on new ParquetRecordReader.Builder(path, schema). We can have a table option to determine whether column indexes are used when performing queries. This should default to false.

We want to default the readahead range used for queries to the row group size. We already have a table property for the readahead range. Suggest we simply set this to the same default as the row group size.

We can explicitly set the use of column indexes to false when performing Java compactions. There seems to be no need to have this as an option.

The text was updated successfully, but these errors were encountered:

gaffer01 added the enhancement New feature or request label Dec 13, 2024

gaffer01 added this to the 0.28.0 milestone Dec 13, 2024

rtjd6554 self-assigned this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise parquet read parameters #3928

Optimise parquet read parameters #3928

gaffer01 commented Dec 13, 2024 •

edited

Loading

Optimise parquet read parameters #3928

Optimise parquet read parameters #3928

Comments

gaffer01 commented Dec 13, 2024 • edited Loading

Background

Description

Analysis

gaffer01 commented Dec 13, 2024 •

edited

Loading