Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design of Serialization/Deserialization #2

Closed
JanKaul opened this issue Jul 21, 2023 · 7 comments
Closed

Design of Serialization/Deserialization #2

JanKaul opened this issue Jul 21, 2023 · 7 comments

Comments

@JanKaul
Copy link
Collaborator

JanKaul commented Jul 21, 2023

I would like to have a discussion about how to implement the Serialization/Deserialization of the iceberg metadata. There are currently two different designs each with its pros and cons. The designs are:

  1. Split representation into in_memory/on_disk part (example)
  2. Use custom implementations of Serialize/Deserialize (example)

I would like to get your opinion which design you prefer.

@ZENOTME
Copy link
Contributor

ZENOTME commented Jul 21, 2023

If I'm not mistaken, use the split representation also need to custom implementation for on_disk part. And we should maintain two system: in_memory::type, in_memory::value, on_disk::type, on_disk::value. And we also need to maintain the conversion between in_memory and on_disk. So I think strictly separate two part may cause redundant code. Can we only separate some part🤔?

@JanKaul
Copy link
Collaborator Author

JanKaul commented Jul 21, 2023

I also think that a split representation leads to redundancy which makes maintaining the code more difficult.

I personally think that the added complexity of custom implementations of Serialize/Deserialize is worth avoiding the redundancy.

@JanKaul
Copy link
Collaborator Author

JanKaul commented Jul 21, 2023

I realized that this discussion is closely related to #3 , because it gives additional motivation for splitting the representation into an in_memory and on_disk part.

I think @ZENOTME made a good point about splitting only a part of the spec.

We could split the representation for all structs that have user interaction (i.e. TableMetadata) but use a single representation for the smaller building blocks (i.e. types, values, ..)

@liurenjie1024
Copy link
Collaborator

Personally, I prefer the Serialize/Deserialize approach. From my experience, the effort doesn't reduce much since both way you need to maintain the conversion logic. The on_disk approach has some shorting comings:

  1. It's not easy to be comptaible with both v1 and v2 spec. See this pr, I have to remove v1 support since avro has strict schema check.
  2. Sometimes we still need to write the serializaer, see https://github.com/liurenjie1024/icelake/blob/eee3520f993837f4e1745c60538ba7ab1b421481/src/types/in_memory.rs#L64
    https://github.com/icelake-io/icelake/blob/eee3520f993837f4e1745c60538ba7ab1b421481/src/types/in_memory.rs#L64

@ZENOTME
Copy link
Contributor

ZENOTME commented Jul 21, 2023

We could split the representation for all structs that have user interaction (i.e. TableMetadata) but use a single representation for the smaller building blocks (i.e. types, values, ..)

Sounds good for me.

@Xuanwo
Copy link
Member

Xuanwo commented Jul 21, 2023

Ok, let's pick the Serialize/Deserialize approach.

We could split the representation for all structs that have user interaction (i.e. TableMetadata) but use a single representation for the smaller building blocks (i.e. types, values, ..)

Nice idea!

@liurenjie1024
Copy link
Collaborator

Close this as we have finished discussion. Feel free to open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants