-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] ds.dataset cannot filter on hive-style partitions created with org.apache.spark.version: '3.4.1' #37802
Comments
It seems, that the issue is not with pyspark version, but with something other. :( |
It is hard to debug issues without a reproducible example. Does the filtering in polars give the issue or reading the dataset in pyarrow? That is, if you load the dataset only with pyarrow without using polars ( You can also inspect the schema of the two different datasets created with different versions of Apache Spark, see https://arrow.apache.org/docs/python/dataset.html#dataset-discovery. Maybe you will be able to find the difference? |
Oh, you mentioned in a latter comment that pyspark version is not an issue. What exactly is the issue then? Do you run out of memory in any case (no matter which version of pyspark you are using)? |
Yes. The parquet dataset is 8.5 GB and upon reading it up, I run out of 32 GB of ram. |
The bug seems to be in the to_polars() / to_pandas() method (both result in running out of memory): Traceback (most recent call last): The M999912 partition is less than 6 GB on disk (in parquet).
|
If you leave out the Could you show the schema of the dataset? |
Leaving out the to_polars().head() shows the same behaviour: runs out of memory. Number of records should be 426694231 in that partition. |
Can you do |
Reading the first 100000 records works fine. |
Extrapolating that to the full size of the file would give around 27GB in memory. Are you sure you have enough memory? You mentioned 32GB, but there might be other programs running also requiring memory? |
Can be. |
Closing this, because not really reproducible, and not sure what was causing the issue (polars or pyarrow). |
Describe the bug, including details regarding any error messages, version, and platform.
Not sure if this belongs to polars or pyarrow.
If I run the above code on a hive-partitioned parquet file created with org.apache.spark.version: '3.4.0' it runs fine.
If I run it on a file (having 8 simple columns) created with org.apache.spark.version: '3.4.1', it runs out of 32 GB memory.
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: