-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support partition value string deserialization for timestamp/binary #371
Conversation
I'm not sure the binary format test case and implementation is correct. In the delta spec description, the binary data would be encoded as a string of escaped binary values. For example, "\u0001\u0002\u0003". I tried to generate a delta table which partitioned with binary data column with values "\u0001" and "\u0002". In the _delta_log file, the partitionValues seems to be consistent with the document:
Then I tried to debug the new implementation with following test case:
And watch the input parameter of Although
I'm not sure if I've missed some important things here... |
Thanks @zijie0 , I will take a closer look at the PR later today or tomorrow. The timestamp handling part is going to be tricky due to ambiguities in the delta spec, see: delta-io/delta#643. The nanosecond type we write out is a logical timestamp type stored in int64, which is not supported by spark. Spark only reads nanosecond type stored in int96, which is deprecated in newer versions of parquet format. The rust parquet crate supports reading int96 timestamp, but not writing int96 as recommended by the parquet developers. So we need to look into this more to see what's the best way to handle timestamp. |
Hmm... It reminds me a similar issue we encounter some time ago: dask/fastparquet#646 |
Yeah, it's similar to the one you posted. Basically spark writes out timestamp in the deprecated int96 type by default. When reading nanosecond timestamps, it only supports int96, not the newer logical nanosecond type (int64), which is also not recommended due to limited precision constrained by int64. So the best type to use going forward is logical micro second type (int64). This type is supported by spark for both read and write. My hunch on this is we should probably use micro second timestamp here as well instead of nano seconds. |
It seems that in #375 , we will change the time unit back to micro second. Maybe I should modify my PR after the previous one getting merged? |
yeah, I will send a email to the dev list later this week to get more clarifications from the delta authors on what's the expected way to handle timestamp semantics. |
Another (entry level) question is that when I'm debugging the binary format case, the call stack only shows the error is thrown from |
Async stack trace is unfortunately a pain point I have as well. So I don't really have a good solution either. I usually will try running the code with |
Sorry @zijie0 for the late reply, I finally got some time to take a closer look at the binary type test error you reported. I believe it's coming from https://github.com/apache/arrow-rs/blob/3fa6fcdbc09542343fb42a2b5b6c5bbe2e56c125/arrow/src/json/reader.rs#L1300. This means arrow-rs itself currently doesn't support reading json string into binary data type. It should be a fairly straight forward thing to add though. So if you are interested in getting a commit into Arrow Rust implementation, please let me know, I am more than happy to help guide you through the process. If not, I can add that support for you this weekend to unblock this PR.
This is because when we load the json file in rust using serde_json, the string
Given that |
@houqp Cool, let me take a look at arrow-rs project first. |
Hi @houqp , I tried to draft a commit for supporting binary type in arrow-rs. I'm not sure where can I find something like "arrow spec" so I just read the source code to do the implementation. It seems that The commit is: zijie0/arrow-rs@926ea32 |
@zijie0 that looks good to me, please feel free to send an upstream arrow-rs PR :) Arrow spec is documented at https://arrow.apache.org/docs/format/Columnar.html. |
a0949ff
to
a4f71c8
Compare
@houqp I've changed timestamp type to micro seconds. Please take a look when you got some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zijie0 , sorry I forgot to send the timestamp clarification email to the dev list. Will try to get it out this week.
Thank you @zijie0 ! We can tackle the timestamp precision problem later :) |
Description
support partition value string deserialization for timestamp/binary
Related Issue(s)
Documentation