Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Prevent warning spam when reading tables generated by delta 1.2.1 #651

Merged
merged 1 commit into from
Jun 19, 2022
Merged

fix: Prevent warning spam when reading tables generated by delta 1.2.1 #651

merged 1 commit into from
Jun 19, 2022

Conversation

Tom-Newton
Copy link
Contributor

Description

delta-io/delta@6f39630 added a new numRecords field on remove actions. To prevent a large spam of warnings when reading tables with this new field delta-rs needs to be updated to accept this field.

I've just gone for the simplest possible fix which is to just ignore numRecords when reading. I had a go at making it read the field but being a novice I didn't really know what the impact of those changes would be.

Related Issue(s)

@Tom-Newton Tom-Newton marked this pull request as ready for review June 19, 2022 15:52
@wjones127
Copy link
Collaborator

I've just gone for the simplest possible fix which is to just ignore numRecords when reading.

That's sounds like acceptable behavior to me for now. Until we have a use for it, no need to read that field.

Do we know what the motivation was for adding it? I don't see any pull request or discussion associated with that commit 😞

@Tom-Newton
Copy link
Contributor Author

Do we know what the motivation was for adding it? I don't see any pull request or discussion associated with that commit disappointed

I was also wondering this... possibly @scottsand-db can enlighten us?

@Tom-Newton Tom-Newton changed the title Ignore new numRecords field on remove actions fix: Prevent warning spam when reading tables generated by delta 1.2.1 Jun 19, 2022
@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Jun 19, 2022

Thanks for reviewing @wjones127. Do you think we could get this into the new Python release that looks imminent #640 ?

@wjones127 wjones127 merged commit 82c1936 into delta-io:main Jun 19, 2022
@scottsand-db
Copy link

scottsand-db commented Jun 21, 2022

Hi @Tom-Newton , upon reviewing this further, I'm curious if you are actually seeing _delta_log files with the RemoveFile.numRecords field written out. In Delta Lake, this field is marked as @JsonIgnoreProperties(Array("numRecords")), so I'd expect is isn't written out?

edit: or parquet files.

Would appreciate if you could provide any more details on your end, thanks!

@Tom-Newton
Copy link
Contributor Author

Hi @Tom-Newton , upon reviewing this further, I'm curious if you are actually seeing _delta_log files with the RemoveFile.numRecords field written out. In Delta Lake, this field is marked as @JsonIgnoreProperties(Array("numRecords")), so I'd expect is isn't written out?

Would appreciate if you could provide any more details on your end, thanks!

I'm only seeing it in the checkpoints and it always seems to be null. For example if you run the script I wrote for reproducing it (using delta 1.2.1) then inspect a random remove action in the check point file:

Python 3.8.0 (default, Dec  9 2021, 17:53:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_parquet("/tmp/reproduce_deltalake_warnings/_delta_log/00000000000000000010.checkpoint.parquet")
>>> df.loc[0, "remove"]
{'path': 'part-00000-6e72cce7-1dde-4987-93de-37f39100412c-c000.snappy.parquet', 'deletionTimestamp': 1655833772300.0, 'dataChange': False, 'extendedFileMetadata': True, 'partitionValues': [], 'size': 588.0, 'tags': None, 'numRecords': None}
>>> 

The numRecords field is present and set to null.

@scottsand-db
Copy link

I've made a PR to fix this here: delta-io/delta#1230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warning spam: Unexpected field name numRecords for remove action
3 participants