Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#3365] feat(spark-connector): Support Iceberg metadata tables #3481

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

caican00
Copy link
Collaborator

@caican00 caican00 commented May 21, 2024

What changes were proposed in this pull request?

Support Iceberg metadata tables, such as:

  ENTRIES,
  FILES,
  DATA_FILES,
  DELETE_FILES,
  HISTORY,
  METADATA_LOG_ENTRIES,
  SNAPSHOTS,
  REFS,
  MANIFESTS,
  PARTITIONS,
  ALL_DATA_FILES,
  ALL_DELETE_FILES,
  ALL_FILES,
  ALL_MANIFESTS,
  ALL_ENTRIES,
  POSITION_DELETES

Why are the changes needed?

Support Iceberg metadata tables.

Fix: #3365

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New IT and modified ITs.

@caican00
Copy link
Collaborator Author

supported read Iceberg metadata tables, could you please help review it if you are free?
cc: @FANNG1

@@ -351,7 +351,8 @@ public static TableIdentifier buildIcebergTableIdentifier(
*/
public static TableIdentifier buildIcebergTableIdentifier(NameIdentifier nameIdentifier) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why changing this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why changing this?

see #3481 (comment).
the namespace maybe be concacted by db and table, so here we must to split it.

return getCatalogDefaultNamespace();
} else if (sparkIdentifier.namespace().length == 1) {
return sparkIdentifier.namespace()[0];
} else if (sparkIdentifier.namespace().length == 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what conditions when length == 2? could you add description ?

Copy link
Collaborator Author

@caican00 caican00 May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for an Iceberg metadata table, such as iceberg.default_db.test_iceberg_metadata_table.snapshots, the sparkIdentifier.namespace() should be ['default_db', 'test_iceberg_metadata_table'], and there is a limitation in RelationalCatalog, so I decide to concact the sparkIdentifier.namespace() here to avoid exception.

https://github.com/datastrato/gravitino/blob/ce29d83880a68c34f9aa25a8ced103c4ff9f118c/api/src/main/java/com/datastrato/gravitino/Namespace.java#L165-L170

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this method, the Namespace will contain metalake, catalog and db.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for iceberg.default_db.test_iceberg_metadata_table.snapshots, getDatabase will return default_db.test_iceberg_metadata_table? why not default_db?

Copy link
Collaborator Author

@caican00 caican00 May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for iceberg.default_db.test_iceberg_metadata_table.snapshots, getDatabase will return default_db.test_iceberg_metadata_table?

Yes.

why not default_db?

for an iceberg metadata table, such as iceberg.default_db.test_iceberg_metadata_table.snapshots, the real table name is snapshots. Spark Identifier automatically uses the last part as a tableName, and everything else except the catalog and tableName as a namespace

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for iceberg.default_db.test_iceberg_metadata_table.snapshots, getDatabase will return default_db.test_iceberg_metadata_table?

Yes.

why not default_db?

for an iceberg metadata table, such as iceberg.default_db.test_iceberg_metadata_table.snapshots, the real table name is snapshots. Spark Identifier automatically uses the last part as a tableName, and everything else except the catalog and tableName as a namespace

cc: @FANNG1

@caican00
Copy link
Collaborator Author

comments have been addressed and could you please help review again? @FANNG1

@caican00 caican00 requested a review from FANNG1 May 22, 2024 01:49
@pan3793
Copy link
Member

pan3793 commented May 22, 2024

I have a question about the permission management of the metadata table, though it might be out of the scope of this PR.

Suppose there is an Iceberg table iceberg_ctlg.my_db.my_table, when role X is granted permission to the table, does it mean X is granted the same permission for all metatables?

@FANNG1
Copy link
Contributor

FANNG1 commented May 22, 2024

The current implementation retrivies metadata table (like Iceberg snapshot table) from Gravitino, I'm afraid whether this's out of the scope of Gravitino and Whether Gravitino is ready to provide the corresponding abilities (like we have to combine database and table as gravitino namespace because Gravitino only support one level namespace). Currently how about querying metadata table from underlying Iceberg catalog not though Gravitino, but this may skip the privilege check, WDYT? @jerryshao @caican00 @qqqttt123

@FANNG1
Copy link
Contributor

FANNG1 commented May 22, 2024

I have a question about the permission management of the metadata table, though it might be out of the scope of this PR.

Suppose there is an Iceberg table iceberg_ctlg.my_db.my_table, when role X is granted permission to the table, does it mean X is granted the same permission for all metatables?

The permission solution of spark connector had not been considered for now. From the first eye, seems we should keep the same permission for metatables. Kyuubi may encounter similar problems, @pan3793 could you share your thought? cc @qqqttt123 @jerryshao

@pan3793
Copy link
Member

pan3793 commented May 22, 2024

There was an argument in the Kyuubi community about that. There are two opinions:

  1. metadata tables are independent resources, should be granted permissions explicitly
  2. metadata tables should inherit permissions from the data table

We have no conclusion yet. Just want to follow the common practices.

@caican00
Copy link
Collaborator Author

The current implementation retrivies metadata table (like Iceberg snapshot table) from Gravitino, I'm afraid whether this's out of the scope of Gravitino and Whether Gravitino is ready to provide the corresponding abilities (like we have to combine database and table as gravitino namespace because Gravitino only support one level namespace). Currently how about querying metadata table from underlying Iceberg catalog not though Gravitino, but this may skip the privilege check, WDYT? @jerryshao @caican00 @qqqttt123

it seems not only skipping privilege check, but audit will also be skipped if the audit ability is supported. cc @FANNG1

@caican00
Copy link
Collaborator Author

I have a question about the permission management of the metadata table, though it might be out of the scope of this PR.

Suppose there is an Iceberg table iceberg_ctlg.my_db.my_table, when role X is granted permission to the table, does it mean X is granted the same permission for all metatables?

FYI: Iceberg metadata tables are created implicitly, and it seems more reasonable to inherit the permissions of data tables directly.

@qqqttt123
Copy link
Contributor

We have a load table operation for rest api. If someone has read or write table privilege, he can load the table, too.

@jerryshao
Copy link
Contributor

jerryshao commented May 22, 2024

I would suggest we think of a thorough and elegant solution in namespace and nameIdentifier to handle this scenario, we don't want a quick and dirty solution, it would be better to see how to adjust the API to support this scenario. Besides, all the privileges/audits and others should be controlled by Gravitino, so we should not bypass Gravitino.

@caican00
Copy link
Collaborator Author

caican00 commented May 23, 2024

I would suggest we think of a thorough and elegant solution in namespace and nameIdentifier to handle this scenario, we don't want a quick and dirty solution, it would be better to see how to adjust the API to support this scenario. Besides, all the privileges/audits and others should be controlled by Gravitino, so we should not bypass Gravitino.

got it.

@FANNG1
Copy link
Contributor

FANNG1 commented Jun 4, 2024

@caican00 do you like to continue the work since #3696 is merged

@caican00
Copy link
Collaborator Author

caican00 commented Jun 5, 2024

@caican00 do you like to continue the work since #3696 is merged

@FANNG1 ok, i will continue to work on this.

@caican00 caican00 marked this pull request as draft August 12, 2024 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] Support Iceberg metadata tables
5 participants