Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Design of TableMetadataBuilder. #232

Open
liurenjie1024 opened this issue Mar 7, 2024 · 13 comments
Open

Discussion: Design of TableMetadataBuilder. #232

liurenjie1024 opened this issue Mar 7, 2024 · 13 comments

Comments

@liurenjie1024
Copy link
Collaborator

Problem statement

TableMetadataBuilder is useful in modifying/creating TableMetadata, and is a core data structure of transaction api. There are already some efforts to create one using derived builder, see #62 and #229. While it's easy to implement, I have some concerns about this approach. There are some problems with this approach:

  1. Not easy to review. it's quite easy to forget that we have changed builder api when we do modification to TableMetadata.
  2. Not easy to customize. I've checked the builder in java api, and it has a lot of customization in each method with validations.

The reason I want to be careful with TableMetadataBuilder is that it's a core data structure in iceberg, and we should not expose apis which be misused to construct inconsistent TableMetadata.

Proposal

I want to propose to implement TableMetadataBuilder in a similar approach like java api, rather than using derived builder directly, such as following:

pub struct TableMetadataBuilder(TableMetadata);

impl TableMetadataBuilder {
   pub fn add_schema(mut self, schema: Schema) -> Result<Self> {
       // Check if schema validity
   }

  pub fn add_unbound_partition_spec(mut self, spec: UnboundPartitionSpec) -> Result<Self> { 
   // Binding partition spec
  }

  pub fn build(self) -> Result<TableMetadata> {}
}
@liurenjie1024
Copy link
Collaborator Author

@Xuanwo
Copy link
Member

Xuanwo commented Mar 7, 2024

When add_schema has been called, both schemas and current_schema_id will be updated? The same question to partition_specs.

@liurenjie1024
Copy link
Collaborator Author

When add_schema has been called, both schemas and current_schema_id will be updated? The same question to partition_specs.

Yes, the add_schema process consists of several steps as in java implementation

  1. Check if already existed same schema.
  2. If not, assign schema id to it.
  3. Modify schemas.

Similar things happens for partition_specs.

Other methods such as setBranchSnapshot are also quite complicated.

@sdd
Copy link
Contributor

sdd commented Mar 7, 2024

I agree with @liurenjie1024. Looking at the Java implementation, it seems like there is too much going on here for a derived builder to be a good fit, and a hand-rolled builder would be more appropriate.

@liurenjie1024 liurenjie1024 changed the title Discuss: Design of TableMetadataBuilder. Discussion: Design of TableMetadataBuilder. Mar 7, 2024
@ZENOTME
Copy link
Contributor

ZENOTME commented Mar 8, 2024

I want to propose to implement TableMetadataBuilder in a similar approach like java api, rather than using derived builder directly

Implementing the builder manually looks good to me.

@liurenjie1024
Copy link
Collaborator Author

Cool, seems we reached consensus on this, let's move on to create TableMetadataBuilder.

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Mar 10, 2024

what about supporting FormatVersion V1 and V2? Will we simply model V2 and provide all optional fields as well for V1 in order to simplify the struct TableMetadata (like it is right now?) - or would there be any reason to change that?

Also how would we handle the update or modifications to an existing TableMetadata. Using TableMetadataBuilder for the controlled construction and implement the update methods directly on the TableMetadata could be one approach?

@liurenjie1024
Copy link
Collaborator Author

what about supporting FormatVersion V1 and V2? Will we simply model V2 and provide all optional fields as well for V1 in order to simplify the struct TableMetadata (like it is right now?) - or would there be any reason to change that?

Sorry, I don't get your point. I think currently our approach is just like what you say, and we don't have plan to change it. Oh, you mean further modification to TableMetadata ? In fact, I mean the evolution of table metadata spec in future.

Also how would we handle the update or modifications to an existing TableMetadata. Using TableMetadataBuilder for the controlled construction and implement the update methods directly on the TableMetadata could be one approach?

That's possible, but that means we need to implement the validation logic in two places, and I think we should avoid such duplication.

@marvinlanhenke
Copy link
Contributor

[...] and we don't have plan to change it

That's possible, but that means we need to implement the validation logic in two places

Makes sense to me. thanks for clarifying.

@c-thiel
Copy link
Contributor

c-thiel commented Jul 23, 2024

@liurenjie1024, @Fokko do you know if someone is working on a full implementation of the TableMetadataBuilder?

The question is if we need it similar to java with all that's being implemented as part of transactions.

If we need it, and nobody is working on it, I would prepare a PR a cleaned-up Version of what we have in the Iceberg-Catalog over to here. It's mostly about adding all modifications as addition of snapshots, schema evolution, partitioning, ordering, various validations, partition binding, etc.

It would mainly contain the logic in this class:
https://github.com/apache/iceberg/blob/604b2bb85ccd33a132e0f588fad1611ab655e2c5/core/src/main/java/org/apache/iceberg/TableMetadata.java#L863
And a few associated classes.

@liurenjie1024
Copy link
Collaborator Author

@liurenjie1024, @Fokko do you know if someone is working on a full implementation of the TableMetadataBuilder?

The question is if we need it similar to java with all that's being implemented as part of transactions.

If we need it, and nobody is working on it, I would prepare a PR a cleaned-up Version of what we have in the Iceberg-Catalog over to here. It's mostly about adding all modifications as addition of snapshots, schema evolution, partitioning, ordering, various validations, partition binding, etc.

It would mainly contain the logic in this class: https://github.com/apache/iceberg/blob/604b2bb85ccd33a132e0f588fad1611ab655e2c5/core/src/main/java/org/apache/iceberg/TableMetadata.java#L863 And a few associated classes.

Hi, @c-thiel AFAIK, nobody is working on this. The plan was to add it one by one when we implement transaction api. It would be greate if you are interested in contributiong.

@c-thiel
Copy link
Contributor

c-thiel commented Jul 24, 2024

@liurenjie1024 I would be interested to contribute.

Let me maybe start with partition binding - i.e. binding an UnboundPartitionSpec to a schema - making it a bound PartitionSpec or re-binding to a different schema on updates.

Do you have a strong feeling of where the code should go? As part of an Action or separately? If not, I would just make a proposal.

@liurenjie1024
Copy link
Collaborator Author

liurenjie1024 commented Jul 24, 2024

@liurenjie1024 I would be interested to contribute.

Let me maybe start with partition binding - i.e. binding an UnboundPartitionSpec to a schema - making it a bound PartitionSpec or re-binding to a different schema on updates.

Do you have a strong feeling of where the code should go? As part of an Action or separately? If not, I would just make a proposal.

Hi, @c-thiel Thanks for your interest. Typical I think small prs would be easier to review and test. That's to say, a prefered approach would be to implement apis for TableMetadataBuilder in one pr, and add one Action with another pr.

Minor nit, putting all actions in one transaction.rs file maybe too large to maintain, and spltting transaction module into files will be easier to maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants