Broken filter for newly created delta table #2169

Hanspagh · 2024-02-06T13:18:33Z

Environment

Delta-rs version:
'0.15.2'

Binding:
python

Environment:

OS: Mac

Bug

When creating a new delta table from a pandas dataframe, it appears that the filter predicate is broken for some expression

What happened:
.to_pandas() and .to_pyarrow_dataset() return 0 data

What you expected to happen:
The above functions should return the data reflected in the filter predicate

How to reproduce it:
This is a large dataset that I cannot share, but please point me in any directions for how to debug this.

This is how I achieved my current results

from pyarrow import dataset as ds
from pyarrow.parquet import ParquetFile, ParquetDataset
import pyarrow.compute as pc
from deltalake import write_deltalake, convert_to_deltalake, DeltaTable
import pandas as pd

df = "some_internal_data"

write_deltalake("broken_nlrot.delta", df)

dt_broken = DeltaTable("broken_nlrot.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("terminal") == "NLROTTM")).shape
# returns (0, 19)

# Trying to read strait with pyarrow
ds.dataset("broken_nlrot.delta").to_table(filter=(pc.field("terminal") == "NLROTTM")).shape
# returns (773536, 19) which is the expected result

#similarly to_pandas also does not seem to work 
dt_broken.to_pandas(filters=[("terminal", "==", "NLROTTM")]).shape 
# returns (0,19)

# again with pyarrow we are fine
ParquetDataset("broken_nlrot.delta", filters=[("terminal", "==", "NLROTTM")]).read().shape
# returns (773536, 19) which is the expected result

#oddly enough it seems to work when I choose a different filter predicate I dont get zero rows, but still wrong resulst
dt_broken.to_pandas(filters=[("terminal", "==", "USSAVGC")]).shape
# (89406, 19) - Wrong

ds.dataset("broken_nlrot.delta").to_table(filter=(pc.field("terminal") == "USSAVGC")).shape
# (420029, 19) - Correct

# pulling the full data in pandas also seems to work
full = dt_broken.to_pandas()
full[full["terminal"] == "NLROTTM"].shape
# (773536, 19)

full[full["terminal"] == "USSAVGC"].shape
# (420029, 19)


# calling optimize seem to fix the problem, but this should not be needed in order for filters to work
dt.optimize.z_order(["terminal"])
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("terminal") == "NLROTTM")).shape
# returns (773536, 19)

Since this is returning partially correct results, I suspect maybe some row_group statistics being wrong, but then I would assume the calls from pyarrow would also return incorrect results

ion-elgreco · 2024-02-06T14:03:42Z

@Hanspagh can you please check the following things:

try write_deltalake(engine='rust') since this eliminates pyarrow from the equation (also please share the pyarrow version you use now)
try deltalake v0.15.1 or v0.15.0

Also, it would really help if you can mimic the structure of the data with fake/sample data so we can try to reproduce, the only logical thing I can think of for now is the partition expression is incorrect

Hanspagh · 2024-02-06T14:36:52Z

Sure, let me try those suggestions first, then I can try to see if I can reproduce the problem with a smaller subset

…

On Tue, 6 Feb 2024, 15.03 Ion Koutsouris, ***@***.***> wrote: @Hanspagh <https://github.com/Hanspagh> can you please check the following things: - try write_deltalake(engine='rust') since this eliminates pyarrow from the equation (also please share the pyarrow version you use now) - try deltalake v0.15.1 or v0.15.0 Also, it would really help if you can mimic the structure of the data with fake/sample data so we can try to reproduce, the only logical thing I can think off for now is the partition expression is incorrect — Reply to this email directly, view it on GitHub <#2169 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAH2DIECN5SVWK4NPQWPFULYSIZ4VAVCNFSM6AAAAABC37HOTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZG42DANJUGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Hanspagh · 2024-02-06T15:24:03Z

Okay, so this seem to be related to pyarrow since engine="rust" fixes this.
Currently, I am using Pyarrow 13.0.0, will play around with the pyarrow/deltalake versions

Hanspagh · 2024-02-06T16:26:21Z

So I managed to reproduce this. It only happens with large dataset 10_485_761 seems to be the magic number, I tried both with pyarrow 15, 13, 12 10, 9. With pyarrow 8 the procces seems to hang when I try to save a frame this big.

It looks as the filter overflows and only returns the rows that large than 10_485_760, since we get 1 for 10_485_761 and 2 for 10_485_762

10_485_760 is also oddly close to 1024**2 = 1_048_576

I hope this helps to figure out what is going on here. Let me know if you want me to provide more details

df = pd.DataFrame({"data": ["B"] * 10_485_760 })
write_deltalake("sample.delta", df, mode="overwrite")
dt_broken = DeltaTable("sample.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (10485760, 1)

df = pd.DataFrame({"data": ["B"] * 10_485_761 })
write_deltalake("sample.delta", df, mode="overwrite")
dt_broken = DeltaTable("sample.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (1, 1)

df = pd.DataFrame({"data": ["B"] * 10_485_762 })
write_deltalake("sample.delta", df, mode="overwrite")
dt_broken = DeltaTable("sample.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (2, 1)

Hanspagh · 2024-02-06T16:41:27Z

I found the magic number, it comes from the default of max_rows_per_file in write_deltalake, which is set to 1024102410, I suggest this to be aligned with pyarrow where the default is None/0.
Properly min_rows_per_group, max_rows_per_group should be aligned as well.

delta-rs/python/deltalake/writer.py

Line 181 in 3ded236

max_rows_per_file: int = 10 * 1024 * 1024,

The limit forces pyarrow to split the parquet in two and it seems like deltalake then ignores all but the last of those split files

ion-elgreco · 2024-02-06T16:48:41Z

@Hanspagh there seems to be an issue with the creation of the pyarrow.dataset when data there are multiple parquets. I can write tables with v0.15.2 and then read them with v0.15.1 with the pc.field("data")=="B" expression.

v0.15.2 gives this fragment expression:

0-e67ec940-2288-4e27-90a5-ba1a07745685-0.parquet ((data >= null[string]) and (data <= null[string]))

While v0.15.1 gave 0-e67ec940-2288-4e27-90a5-ba1a07745685-0.parquet None

Hanspagh · 2024-02-06T16:53:49Z

You are right this is only a problem in 0.15.2

Also 0.15.2 seems to printing some debugging info

partition_values: {}
path: "0-14c7a12a-ee21-4fc6-b8a4-ea79c9a45c13-1.parquet"
partition_values: {}
path: "0-14c7a12a-ee21-4fc6-b8a4-ea79c9a45c13-0.parquet"

Hanspagh · 2024-02-06T17:01:48Z

Hmm, but it does not seem to be strictly related to number of files

this works fine

df = pd.DataFrame({"data": ["B"] * 10 })
write_deltalake("broken.delta", df, max_rows_per_file=2, max_rows_per_group=2, min_rows_per_group=2)
# 5 output files
dt_broken = DeltaTable("/Users/hans.pagh/Downloads/broken.delta")
dt_broken.to_pyarrow_dataset().to_table(filter=(pc.field("data") == "B")).shape
# (10,1)

ion-elgreco · 2024-02-06T17:06:26Z

@Hanspagh I see the issue, the stats are empty on the add action for one of the files, will have to check why they are empty now and not before : )

Edit:
Actually they were missing before as well, it's that now this results in a part expression which thinks there are null values 😕

Hanspagh · 2024-02-06T17:18:15Z

This seems to one of the smaller examples, where it is broken

df = pd.DataFrame({"data": ["B"] * 1024 * 33 })
write_deltalake("broken.delta", df, max_rows_per_file=1024*32,max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, mode="overwrite")

This seems fine, so there is something with the max_rows_per_file

df = pd.DataFrame({"data": ["B"] * 1024 * 33 })
write_deltalake("broken.delta", df, max_rows_per_file=1024*31,max_rows_per_group=1024 * 16, min_rows_per_group=8*1024, mode="overwrite")

ion-elgreco · 2024-02-06T17:27:09Z

@Hanspagh found the culprit, there seems to be an empty row group in the parquet. Our function get_file_stats_from_metadata is checking whether the stats are set for each row group, but in this case the last row group is empty and has no stats set, so it's skipping to set stats

Hanspagh · 2024-02-06T17:34:47Z

Great find, really great to see this could be solved so fast. :)

Unrelated to this, deltalake seems to create more row_groups than the pyarrow where the row limit per group is set to 1mill, is there a specific reason for this?

ion-elgreco · 2024-02-06T17:39:19Z

@Hanspagh you mean the rust engine?

Hanspagh · 2024-02-06T17:40:50Z

No, your default settings for max_rows_per_group is 128*1024, where for pyarrow it is 1024*1024, it was my understanding that having "too small" row_groups was not ideal.

I also suspect that the min_rows_per_group default is the reason for empty row_group?

pyarrow defaults
https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html

ion-elgreco · 2024-02-06T18:16:00Z

@Hanspagh not sure, I think they originated from some defaults databricks does with spark-delta.

Fix is incoming btw

@roeap

# Description For some odd reason the pyarrow parquet writer will leave empty row groups in the parquet file when it hits the max_row limit that's passed. While grabbing the stats we were checking if all row_groups were having stats added to them but these empty row groups had no stats so it causes the whole file add action to get no stats recorded. We now skip empty row groups while gathering the stats to prevent this. In v0.15.2 we now also evaluate files with no stats mentioned as null @roeap @rtyler not sure if this is entirely correct as well # Related Issue(s) - closes #2169 --------- Co-authored-by: Will Jones <[email protected]>

Hanspagh added the bug Something isn't working label Feb 6, 2024

ion-elgreco mentioned this issue Feb 6, 2024

fix(python): skip empty row groups during stats gathering #2172

Merged

ion-elgreco closed this as completed in #2172 Feb 6, 2024

ion-elgreco mentioned this issue Feb 7, 2024

[Python][Parquet] Empty row groups left behind after hitting max_rows_per_file in ds.write_dataset apache/arrow#39965

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken filter for newly created delta table #2169

Broken filter for newly created delta table #2169

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 via email

Hanspagh commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024

ion-elgreco commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024

Broken filter for newly created delta table #2169

Broken filter for newly created delta table #2169

Comments

Hanspagh commented Feb 6, 2024 • edited Loading

Environment

Bug

ion-elgreco commented Feb 6, 2024 • edited Loading

Hanspagh commented Feb 6, 2024 via email

Hanspagh commented Feb 6, 2024 • edited Loading

Hanspagh commented Feb 6, 2024 • edited Loading

Hanspagh commented Feb 6, 2024 • edited Loading

ion-elgreco commented Feb 6, 2024 • edited Loading

Hanspagh commented Feb 6, 2024 • edited Loading

Hanspagh commented Feb 6, 2024

ion-elgreco commented Feb 6, 2024 • edited Loading

Hanspagh commented Feb 6, 2024 • edited Loading

ion-elgreco commented Feb 6, 2024

Hanspagh commented Feb 6, 2024 • edited Loading

ion-elgreco commented Feb 6, 2024

Hanspagh commented Feb 6, 2024 • edited Loading

ion-elgreco commented Feb 6, 2024

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

ion-elgreco commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading

Hanspagh commented Feb 6, 2024 •

edited

Loading