Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for storage table management of materialized views #21797

Open
mosabua opened this issue May 1, 2024 · 9 comments
Open

Support for storage table management of materialized views #21797

mosabua opened this issue May 1, 2024 · 9 comments
Assignees

Comments

@mosabua
Copy link
Member

mosabua commented May 1, 2024

Materialized views create storage tables transparently in Trino. However these tables are completely hidden in Trino and not accessible directly. As a result they can not be managed within Trino itself.

Use cases for this management include:

  • Improve performance by adding partitioning of the table
  • Run optimizes on the table or parts of the table
  • Remove snapshots of the table

Potential approaches:

  • Expose the storage table in Trino somehow and then allow usage of ALTER TABLE EXECUTE and similar commands
  • Create new command ALTER MATERIALIZED VIEW EXECUTE that allows performing tasks on the underlying storage table
@findepi
Copy link
Member

findepi commented May 6, 2024

However these tables are completely hidden in Trino and not accessible directly.

Yes. The fact that the storage exists is a necessity, but what the storage is -- this is abstracted.

Use cases for this management include:

  • Improve performance by adding partitioning of the table
  • Run optimizes on the table or parts of the table
  • Remove snapshots of the table

That would break the abstraction, tying us to the current implementation.

The necessary storage optimizations should be (and probably are) carried automatically by MV refreshes.
Partitioning is Iceberg MV concept and should be alterable via ALTER MATERIALIZED VIEW SET PROPERTIES.

@mosabua
Copy link
Member Author

mosabua commented May 6, 2024

Maybe closely moving it to storage table semantics breaks the abstraction, but that is NOT necessary. The abstraction can stay with the fact that it is a materialized view and not a view. That materializations automatically has needs for maintenance and the relevant tasks should be possible in Trino.

Currently users are forced to use the tooling from the underlying Iceberg system to perform this maintenance. That is a bad user experience and also a more brittle approach. The refresh does NOT currently perform necessary maintenance and users are asking for more control.

@osscm
Copy link
Contributor

osscm commented May 6, 2024

The refresh does NOT currently perform necessary maintenance and users are asking for more control.
yes, looks like current refresh does not support.

May be, keeping optimization separate than refresh can be handy, as users might want to only metadata optimization, instead of refreshing the MV.
As users currently does optimization separately, and even build tools/eco system to do table maintenance (typically uses Spark), so can schedule MV optimization using the same logistics.

@findepi
Copy link
Member

findepi commented May 6, 2024

The refresh does NOT currently perform necessary maintenance and users are asking for more control.

Currently, the refresh writes data anew (like CTAS). In Hive/Iceberg/Delta CTAS is supposed to write data in a form that's decent for querying. If this is not the case, all tables written by Trino are underperforming and we need to fix the CTAS/INSERT flow in the Iceberg connector.

@mosabua
Copy link
Member Author

mosabua commented May 6, 2024

Sure .. if we can improve the writing and inserting that would be good. At the same time we should allow maintenance on materialized views and not force users into some hacks with other tools on the underlying storage to work around problems.

@rstyp
Copy link
Contributor

rstyp commented May 6, 2024

Here are some points for consideration:

  • It is my understanding that during a refresh (with Iceberg), old data files are not physically deleted; rather, they are merely marked as deleted in the metadata snapshot. These files will persist in storage until an expire-snapshot operation is executed.
  • Over time, with each refresh, the metadata expands, optimizing metadata might be important as well.

@vgankidi
Copy link

vgankidi commented May 6, 2024

As we are adding support for incremental refresh, we may have to optimize the data files in the MV too. As @rstyp mentioned metadata compaction may help with planning times as well as the metadata grows. Currently no table maintenance operations can be done on the underlying iceberg tables of MV.

@findepi
Copy link
Member

findepi commented May 7, 2024

Isn't metadata compaction implicit in Iceberg?

@mosabua
Copy link
Member Author

mosabua commented May 23, 2024

Agreed to proceed with ALTER MATERIALIZED VIEW EXECUTE as approach in TCC. Suggestions is to follow the path of grammar PR, an SPI PR, and then a connector implementation PR. cc @dain @martint

https://github.com/trinodb/trino/wiki/Contributor-meetings#trino-contributor-call-23-may-2024

@osscm and team are going to start looking at working on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants