diff --git a/docs/manage-table-partition-using-gravitino.md b/docs/manage-table-partition-using-gravitino.md new file mode 100644 index 00000000000..07d4b70a298 --- /dev/null +++ b/docs/manage-table-partition-using-gravitino.md @@ -0,0 +1,379 @@ +--- +title: "Manage table partition using Gravitino" +slug: /manage-table-partition-using-gravitino +date: 2024-02-03 +keyword: table partition management +license: Copyright 2024 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +## Introduction + +Although many catalogs inherently manage partitions automatically, there are scenarios where manual partition management is necessary. Usage scenarios like managing the TTL (Time-To-Live) of partition data, gathering statistics on partition metadata, and optimizing queries through partition pruning. For these reasons, Gravitino provides capabilities of partition management. + +### Requirements and limitations + +- Partition management is based on the partitioned table, so please ensure that you are operating on a partitioned table. + +The following table shows the partition operations supported across various catalogs in Gravitino: + +| Operation | Hive catalog | Iceberg catalog | Jdbc-Mysql catalog | Jdbc-PostgreSQL catalog | +|-----------------------|--------------------------------------------------------------------|--------------------------------------------------------------------|--------------------|-------------------------| +| Add Partition | YES | NO | NO | NO | +| Get Partition by Name | YES | NO | NO | NO | +| List Partition Names | YES | NO | NO | NO | +| List Partitions | YES | NO | NO | NO | +| Drop Partition | [Coming Soon](https://github.com/datastrato/gravitino/issues/1655) | [Coming Soon](https://github.com/datastrato/gravitino/issues/1655) | NO | NO | + +:::tip[WELCOME FEEDBACK] +If you need additional partition management support for a specific catalog, please feel free to [create an issue](https://github.com/datastrato/gravitino/issues/new/choose) on the [Gravitino repository](https://github.com/datastrato/gravitino). +::: + +## Partition operations + +### Add partition + +You must match the partition types you want to add with the table's [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning) types; Gravitino currently supports adding the following partition types: + +| Partition Type | Description | +|----------------|------------------------------------------------------------------------------------------------------------------------------------------------| +| identity | An identity partition represents a result of identity [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). | +| range | A range partition represents a result of range [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). | +| list | A list partition represents a result of list [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). | + +For JSON examples: + + + + +```json +{ + "type": "identity", + "name": "dt=2008-08-08/country=us", + "fieldNames": [ + [ + "dt" + ], + [ + "country" + ] + ], + "values": [ + { + "type": "literal", + "dataType": "date", + "value": "2008-08-08" + }, + { + "type": "literal", + "dataType": "string", + "value": "us" + } + ] +} +``` + +:::note +The values of the field `values` must be the same ordering as the values of `fieldNames`. + +When adding an identity partition to a partitioned Hive table, the specified partition name is ignored. This is because Hive generates the partition name based on field names and values. +::: + + + + +```json +{ + "type": "range", + "name": "p20200321", + "upper": { + "type": "literal", + "dataType": "date", + "value": "2020-03-21" + }, + "lower": { + "type": "literal", + "dataType": "null", + "value": "null" + } +} +``` + + + + +```json +{ + "type": "list", + "name": "p202204_California", + "lists": [ + [ + { + "type": "literal", + "dataType": "date", + "value": "2022-04-01" + }, + { + "type": "literal", + "dataType": "string", + "value": "Los Angeles" + } + ], + [ + { + "type": "literal", + "dataType": "date", + "value": "2022-04-01" + }, + { + "type": "literal", + "dataType": "string", + "value": "San Francisco" + } + ] + ] +} +``` + +:::note +Each list in the lists must have the same length. The values in each list must correspond to the field definitions in the list [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). +::: + + + + +For Java examples: + + + + +```java +Partition partition = + Partitions.identity( + "dt=2008-08-08/country=us", + new String[][] {{"dt"}, {"country"}}, + new Literal[] { + Literals.dateLiteral(LocalDate.parse("2008-08-08")), Literals.stringLiteral("us") + }, + Maps.newHashMap()); +``` + +:::note +The values are in the same order as the field names. + +When adding an identity partition to a partitioned Hive table, the specified partition name is ignored. This is because Hive generates the partition name based on field names and values. +::: + + + + +```java +Partition partition = + Partitions.range( + "p20200321", + Literals.dateLiteral(LocalDate.parse("2020-03-21")), + Literals.NULL, + Maps.newHashMap()); +``` + + + + + +```java +Partition partition = + Partitions.list( + "p202204_California", + new Literal[][] { + { + Literals.dateLiteral(LocalDate.parse("2022-04-01")), + Literals.stringLiteral("Los Angeles") + }, + { + Literals.dateLiteral(LocalDate.parse("2022-04-01")), + Literals.stringLiteral("San Francisco") + } + }, + Maps.newHashMap()); +``` + +:::note +Each list in the lists must have the same length. The values in each list must correspond to the field definitions in the list [partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning). +::: + + + + +You can add a partition to a partitioned table by sending a `POST` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions` endpoint or by using the Gravitino Java client. +The following is an example of adding a identity partition to a Hive partitioned table: + + + + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "partitions": [ + { + "type": "identity", + "fieldNames": [ + [ + "dt" + ], + [ + "country" + ] + ], + "values": [ + { + "type": "literal", + "dataType": "date", + "value": "2008-08-08" + }, + { + "type": "literal", + "dataType": "string", + "value": "us" + } + ] + } + ] +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://127.0.0.1:8090") + .build(); + +// Assume that you have a partitioned table named "metalake.catalog.schema.table". +Partition addedPartition = + gravitinoClient + .loadMetalake(NameIdentifier.of("metalake")) + .loadCatalog(NameIdentifier.of("metalake", "catalog")) + .asTableCatalog() + .loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table")) + .supportPartitions() + .addPartition( + Partitions.identity( + new String[][] {{"dt"}, {"country"}}, + new Literal[] { + Literals.dateLiteral(LocalDate.parse("2008-08-08")), Literals.stringLiteral("us")}, + Maps.newHashMap())); +``` + + + + +### Get a partition by name + +You can get a partition by its name via sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions/{partition_name}` endpoint or by using the Gravitino Java client. +The following is an example of getting a partition by its name: + + + + +```shell +curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" \ +http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions/p20200321 +``` + +:::tip +If the partition name contains special characters, you should use [URL encoding](https://en.wikipedia.org/wiki/Percent-encoding#Reserved_characters). For example, if the partition name is `dt=2008-08-08/country=us` you should use `dt%3D2008-08-08%2Fcountry%3Dus` in the URL. +::: + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://127.0.0.1:8090") + .build(); + +// Assume that you have a partitioned table named "metalake.catalog.schema.table". +Partition Partition = + gravitinoClient + .loadMetalake(NameIdentifier.of("metalake")) + .loadCatalog(NameIdentifier.of("metalake", "catalog")) + .asTableCatalog() + .loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table")) + .supportPartitions() + .getPartition("partition_name"); +``` + + + + +### List partition names under a partitioned table + +You can list all partition names under a partitioned table by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions` endpoint or by using the Gravitino Java client. +The following is an example of listing all partition names under a partitioned table: + + + + +```shell +curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" \ +http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions +``` + + + + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://127.0.0.1:8090") + .build(); + +// Assume that you have a partitioned table named "metalake.catalog.schema.table". +String[] partitionNames = + gravitinoClient + .loadMetalake(NameIdentifier.of("metalake")) + .loadCatalog(NameIdentifier.of("metalake", "catalog")) + .asTableCatalog() + .loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table")) + .supportPartitions() + .listPartitionNames(); +``` + + + + +### List partitions under a partitioned table + +If you want to get more detailed information about the partitions under a partitioned table, you can list all partitions under a partitioned table by sending a `GET` request to the `/api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/tables/{partitioned_table_name}/partitions` endpoint or by using the Gravitino Java client. +The following is an example of listing all partitions under a partitioned table: + + + + +```shell +curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" \ +http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/tables/table/partitions?details=true +``` + + + + +```java +// Assume that you have a partitioned table named "metalake.catalog.schema.table". +Partition[] partitions = + gravitinoClient + .loadMetalake(NameIdentifier.of("metalake")) + .loadCatalog(NameIdentifier.of("metalake", "catalog")) + .asTableCatalog() + .loadTable(NameIdentifier.of("metalake", "catalog", "schema", "table")) + .supportPartitions() + .listPartitions(); +``` + + + \ No newline at end of file diff --git a/docs/table-partitioning-bucketing-sort-order-indexes.md b/docs/table-partitioning-bucketing-sort-order-indexes.md index c43a76eb21c..ee1aa82cfec 100644 --- a/docs/table-partitioning-bucketing-sort-order-indexes.md +++ b/docs/table-partitioning-bucketing-sort-order-indexes.md @@ -43,6 +43,8 @@ For function partitioning, you should provide the function name and the function - In some cases, you require other information. For example, if the partitioning strategy is `bucket`, you should provide the number of buckets; if the partitioning strategy is `truncate`, you should provide the width of the truncate. +Once a partitioned table is created, you can [manage its partitions using Gravitino](./manage-table-partition-using-gravitino.md). + ## Table bucketing To create a bucketed table, you should use the following three components to construct a valid bucketed table.