Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i32 limit in JSON stats #2646

Closed
alfredolainez opened this issue Jul 3, 2024 · 13 comments · Fixed by #2649
Closed

i32 limit in JSON stats #2646

alfredolainez opened this issue Jul 3, 2024 · 13 comments · Fixed by #2649
Labels
bug Something isn't working

Comments

@alfredolainez
Copy link

Environment

Delta-rs version: 0.18.2

Binding: Python

Environment:

  • Cloud provider: AWS
  • OS: Amazon Linux 2

Bug

What happened:

When reading a DeltaLake table from Polars using pl.read_delta, I get the following error:

DeltaProtocolError: Invalid JSON in file stats: invalid value: integer 4051124561, expected i32 at line 1 column 70

which ultimately comes from deltalake:

image

What you expected to happen: My code in Python can successfully read other tables, it is just this particular table that throws this problem. The particular table where this happens is frequently accessed and I can read it successfully through Spark, so I was expecting to read it as well from deltalake. Not sure if the int32 limitation is part of the protocol or the library should allow for bigger int types.

@alfredolainez alfredolainez added the bug Something isn't working label Jul 3, 2024
@alfredolainez alfredolainez changed the title i32 i32 limit in JSON stats Jul 3, 2024
@sherlockbeard
Copy link
Contributor

can you share a reproducible example ?.
or your json commit file (make sure there is no personal secret information )

below code is working fine

import polars as pl

pl_df = pl.DataFrame({
    "x": [4051124561, 4051124564, 4051124563],
    "y": [4051124561, 4051124564, 4051124563],
    "z": [4051124561, 4051124564, 4051124563],
})

pl_df.write_delta('./temp2')

pl_df = pl.read_delta('./temp2')

print(pl_df)

@alfredolainez
Copy link
Author

alfredolainez commented Jul 4, 2024

Thank for your example @sherlockbeard , I can confirm that works fine for me as well.

I have been trying to find the offending JSON file in the _delta_log folder but the table is massive and there are thousands of json commit files + parquet checkpoint files. So many I can't even list the folder easily to do some string matching on the offending integer number. I can see this integer referenced in the error keeps changing and growing bigger, so hopefully I can limit the search to the _delta_log files that are read first. My error comes up very fast after executing the query, so I guess it must happen in one of the first metadata reads before starting to execute the query. What files are read first by deltalake so that I can limit my search to those? (my query is just a pl.read_delta referencing two partitions and a few columns)

@sherlockbeard
Copy link
Contributor

sherlockbeard commented Jul 4, 2024

delta table reads the last checkpoint and json's created after that

there should be a _last_checkpoint file in _delta_log folder
it contains data like {"size":6,"size_in_bytes":20619,"version":3}
in this case version 3 is my last checkpoint .
and all the json's after that (from 00000000000000000004.json to latest json )

after that you can select json's like

tablePath = "temp4" 
start = "/_delta_log/"
jsonStart = 4
totalFile = 50

for i in range(int(jsonStart), int(jsonStart) + totalFile + 1):
    path = tablePath + start + str(i).zfill(20) + ".json"
    print(path)
    # open and read file for the value

@alfredolainez
Copy link
Author

Big thanks Sherlock, found the culprit!

This is the JSON in _last_checkpoint

{'version': 435104,
 'size': 14292411,
 'parts': 286,
 'sizeInBytes': 4563187026,
 'numOfAddFiles': 9857205,
 'checkpointSchema': {big table schema},
 'checksum': '834a34e25536b983308deb6ce9bb8df4'}

sizeInBytes is the number I get as invalid. Seems the library might not be expecting this big of a checkpoint?

@sherlockbeard
Copy link
Contributor

sherlockbeard commented Jul 4, 2024

hey @alfredolainez in the bug the number is 4051124561

edited my _last_checkpoint size_in_bytes 4563187026 and still able to read it

@alfredolainez
Copy link
Author

The table is being written to frequently so the number is changing. When I checked _last_checkpoint the number in sizeInBytes is the one I get in the error now 🙂.

@alfredolainez
Copy link
Author

alfredolainez commented Jul 4, 2024

Interesting. I copied my _last_checkpoint file to the location in your example (/temp2) and I am able to reproduce the error.

Copying just this as the _last_checkpoint file seems to do the trick to get the DeltaProtocolError:

{"version":435195,"size":14337557,"parts":287,"sizeInBytes":4579627068,"numOfAddFiles":9902351}\n

For reference, I am using deltalake==0.18.2

@sherlockbeard
Copy link
Contributor

sherlockbeard commented Jul 4, 2024

yep using this able to reproduce

funny thing i tried with
{"size":10,"size_in_bytes": 4563187026,"version":7}
it worked fine can you try same in your /temp2 table

ok got the reason . just to confirm your table have deletion vectors ?.

@alfredolainez
Copy link
Author

Using that resulted in a different error for me, I think my table is a bit different than yours. Seeing this issue (#1468) it seems that SizeInBytes is optional so probably that's why.

Not sure if the table has deletion vectors but I would imagine so since storage performance is critical here. Is there any way I can easily check?

@sherlockbeard
Copy link
Contributor

sherlockbeard commented Jul 4, 2024

@rtyler , @ion-elgreco fixing this will require change in delta portal DeletionVectorDescriptor field sizeInBytes from int to long.

@sherlockbeard
Copy link
Contributor

now i am little confused but delta-rs doesn't support reading with deletion vector .
@alfredolainez what are you using to write the table ?. also were you previously able to read this table from polars ?

@alfredolainez
Copy link
Author

I don't have a lot of details on how the table is written but as far as I understand it is Spark. This is the first time I am trying to read these tables with Polars, so not sure if it used to work before. However I can read other tables in the same lake successfully, and I can see the _last_checkpoint in those tables also contain the same sizeInBytes, but under the i32 limit. The tables do seem to have DeletionVectors in the checkPointSchema field, but I can read them fine except for this table where the main sizeInBytes field doesn't fit in i32.

Deletion vectors might be unrelated though, no? Our toy example shows the problem without them.

@sherlockbeard
Copy link
Contributor

created a pr that is fixing the dummy example .
Will suggest you hit the slack #delta-rs once https://go.delta.io/slack your case is very strange

rtyler pushed a commit that referenced this issue Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants