[Python] ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1' #37802

lmocsi · 2023-09-20T12:58:01Z

Describe the bug, including details regarding any error messages, version, and platform.

import polars as pl
import pyarrow.dataset as ds
df = pl.scan_pyarrow_dataset(ds.dataset(parq_path+filename, partitioning='hive'))
df.filter(pl.col('partition_column') == 'value').head(5).collect()

Not sure if this belongs to polars or pyarrow.
If I run the above code on a hive-partitioned parquet file created with org.apache.spark.version: '3.4.0' it runs fine.
If I run it on a file (having 8 simple columns) created with org.apache.spark.version: '3.4.1', it runs out of 32 GB memory.

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

lmocsi · 2023-09-21T12:26:37Z

It seems, that the issue is not with pyspark version, but with something other. :(
(Must be some issue with the data inside.)

AlenkaF · 2023-09-26T12:12:26Z

It is hard to debug issues without a reproducible example.

Does the filtering in polars give the issue or reading the dataset in pyarrow? That is, if you load the dataset only with pyarrow without using polars (ds.dataset(parq_path+filename, partitioning='hive')), do you also have an issue with memory?

You can also inspect the schema of the two different datasets created with different versions of Apache Spark, see https://arrow.apache.org/docs/python/dataset.html#dataset-discovery. Maybe you will be able to find the difference?

AlenkaF · 2023-09-26T12:14:40Z

You can also inspect the schema of the two different datasets created with different versions of Apache Spark, see https://arrow.apache.org/docs/python/dataset.html#dataset-discovery. Maybe you will be able to find the difference?

Oh, you mentioned in a latter comment that pyspark version is not an issue. What exactly is the issue then? Do you run out of memory in any case (no matter which version of pyspark you are using)?

lmocsi · 2023-09-27T15:26:09Z

Yes. The parquet dataset is 8.5 GB and upon reading it up, I run out of 32 GB of ram.
Though, I'm filtering on the partition key. :(
I'm using polars 0.19.3 to read the data. The hive-partitioned dataset was created with pyspark 3.4.1.

lmocsi · 2023-09-27T15:48:44Z

The bug seems to be in the to_polars() / to_pandas() method (both result in running out of memory):

Traceback (most recent call last):
File "/tmp/1000920000/ipykernel_14253/1035655516.py", line 1, in
dataset.to_table(filter=ds.field('MONTH_CODE') == 'M999912').to_polars().head()
File "pyarrow/_dataset.pyx", line 556, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 3638, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 117, in pyarrow.lib.check_status
pyarrow.lib.ArrowMemoryError: malloc of size 201326592 failed

The M999912 partition is less than 6 GB on disk (in parquet).

import pyarrow.dataset as ds
dataset = ds.dataset(path_to_dir, format='parquet', partitioning='hive')
dataset.to_table(filter=ds.field('MONTH_CODE') == 'M999912').to_polars().head()

jorisvandenbossche · 2023-09-29T12:57:41Z

If you leave out the to_polars().head() (only run the to_table()) in the example above, do you then have the same issue? The to_table method already reads everything into memory, so it would be strange that this would not yet fail when conversion to polars (which should be mostly zero-copy) fails.

Could you show the schema of the dataset?

lmocsi · 2023-09-29T13:25:35Z

Leaving out the to_polars().head() shows the same behaviour: runs out of memory.
Schema is this:
EFFECTIVE_START_DATE: timestamp[ns]
EFFECTIVE_END_DATE: timestamp[ns]
VALID_FROM: timestamp[ns]
VALID_TO: timestamp[ns]
AB_PART_PARTY_ID: int64
AD_STTY_ID: int64
VALUE_DICT_ID: int64
MONTH_CODE: string
-- schema metadata --
org.apache.spark.timeZone: 'Europe/Budapest'
org.apache.spark.legacyINT96: ''
org.apache.spark.version: '3.4.0'
org.apache.spark.sql.parquet.row.metadata: '{"type":"struct","fields":[{"' + 569

Number of records should be 426694231 in that partition.

jorisvandenbossche · 2023-09-29T13:44:59Z

Can you do table = dataset.head(100_000, filter=ds.field('MONTH_CODE') == 'M999912') to see if reading the first X rows works OK? And if that works OK, what's the size of this part of the data in memory? (table.nbytes)

lmocsi · 2023-09-29T17:51:29Z

Reading the first 100000 records works fine.
table.nbytes returns 6762500

jorisvandenbossche · 2023-09-29T17:59:37Z

Extrapolating that to the full size of the file would give around 27GB in memory. Are you sure you have enough memory? You mentioned 32GB, but there might be other programs running also requiring memory?

lmocsi · 2023-09-30T21:32:44Z

Can be.
But is the head() command not pushed down, so that not all data is read up in to_polars().head()?

lmocsi · 2024-02-21T09:12:24Z

Closing this, because not really reproducible, and not sure what was causing the issue (polars or pyarrow).

lmocsi added the Type: bug label Sep 20, 2023

github-actions bot added Component: Parquet Component: Python labels Sep 20, 2023

lmocsi closed this as completed Feb 21, 2024

kou changed the title ~~ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1'~~ [Python] ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1' Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1' #37802

[Python] ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1' #37802

lmocsi commented Sep 20, 2023

lmocsi commented Sep 21, 2023

AlenkaF commented Sep 26, 2023

AlenkaF commented Sep 26, 2023

lmocsi commented Sep 27, 2023 •

edited

Loading

lmocsi commented Sep 27, 2023 •

edited

Loading

jorisvandenbossche commented Sep 29, 2023

lmocsi commented Sep 29, 2023

jorisvandenbossche commented Sep 29, 2023

lmocsi commented Sep 29, 2023

jorisvandenbossche commented Sep 29, 2023

lmocsi commented Sep 30, 2023

lmocsi commented Feb 21, 2024

[Python] ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1' #37802

[Python] ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1' #37802

Comments

lmocsi commented Sep 20, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

lmocsi commented Sep 21, 2023

AlenkaF commented Sep 26, 2023

AlenkaF commented Sep 26, 2023

lmocsi commented Sep 27, 2023 • edited Loading

lmocsi commented Sep 27, 2023 • edited Loading

jorisvandenbossche commented Sep 29, 2023

lmocsi commented Sep 29, 2023

jorisvandenbossche commented Sep 29, 2023

lmocsi commented Sep 29, 2023

jorisvandenbossche commented Sep 29, 2023

lmocsi commented Sep 30, 2023

lmocsi commented Feb 21, 2024

lmocsi commented Sep 27, 2023 •

edited

Loading

lmocsi commented Sep 27, 2023 •

edited

Loading